Selection: Regex

Selection: Regex#

No python data wrangling framework would be complete without a nod to regular expressions.

The full syntax of regex is beyond the scope of this tutorial (but you can learn more here or via many online sources) and it’s included as a precaution, you should rarely if ever find you need to dip into regex.

This page is just some simple example on how to use regex in relation to tidychef cells.

IMPORTANT - regex is powerful …but… it’s also quite slow, as a general rule always look to use the other methods whenever they’ll do the same job.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here. We’ll be using th 10th tab named “Table 3c”.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A4:H10")

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3

Simple Regex Examples#

The following are simple examples of how to use regex with tidychef selections.

Note, for brevity we use the common shorthand re for regex.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")

# cells beginning with a capital M
m_cells = table.re("M.*").label_as("Cells starting with a capital M")

# cells containing the word "housing"
housing = table.re(".*housing.*").label_as("Cells containing the word housing")

# cells containing the word "work"
work = table.re(".*work.*").label_as("Cells containing the word work")

# cells ending in a year
year = table.re(".*[0-9][0-9][0-9][0-9]").label_as("Cells ending in a year")

preview(m_cells, housing, work, year, bounded="A4:H10")
Cells starting with a capital M
Cells containing the word housing
Cells containing the word work
Cells ending in a year

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3