Selection: Simple#

Understanding Selection in TidyChef#

Selection is the process of identifying parts of your spreadsheet that you want to work with—such as specific cells, rows, columns, or blocks of data. Because visually structured tables often contain headers, footnotes, and irregular layouts, simply reading the whole sheet won’t produce tidy data.

TidyChef uses Selectables — programmable, composable selectors—to precisely target these meaningful data elements. Mastering selection lets you extract and reshape only the relevant information, forming the foundation for clean, analysis-ready datasets.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here. We’ll be using th 8th tab named “Table 3a”.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3a")
preview(table, bounded="A4:H10")

Table 3a

ABCDEFGH
4Percentage change period on period
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M
7Quarterly dataset identifier codeMV54MV55MVM7MV56MV57MV58MV59
8Monthly dataset identifier codeMV4OMV4PMVM2MV4QMV4RMV4SMV4T
91998-191-1.2-2.85.41.98.4
101999-13.3-10-10.3-2.812.63.212

Row(s), Column(s) & Ranges#

These are easier to show that tell, so we’re going to give quick examples of each of these.

Selecting a row#

Lets just select a row based on its row mumber.

row3 = table.row("7").label_as("Just row 7")
preview(row3, bounded="A5:H8")
Just row 7

Table 3a

ABCDEFGH
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M
7Quarterly dataset identifier codeMV54MV55MVM7MV56MV57MV58MV59
8Monthly dataset identifier codeMV4OMV4PMVM2MV4QMV4RMV4SMV4T

Selecting a row conditionally#

We know we want one row and its a row with at least one cell with “Time period” in it.

row_with_time_period = table.row_containing_strings(["Time period"]).label_as("A rows with 'Time period' in it")
preview(row_with_time_period, bounded="A4:H6")
A rows with 'Time period' in it

Table 3a

ABCDEFGH
4Percentage change period on period
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M

Note - notice its strings plural not string singular, this allows us to be extra specific if we need to, example:

table.row_containing_strings(["Foo", "Bar", "Baz"])

This would still only return exactly one row, but only where said row contained a call ppulated with the value “Foo” and another call populated with the value “Bar” and another one populated with the word “Baz”.

Selecting a column#

column_c = table.column("C").label_as("I am column C")
preview(column_c, bounded="B2:D10")
I am column C

Table 3a

BCD
2
3
4
5Public new housingPrivate new housingTotal new housing
6MV5HMV5IMVM3
7MV54MV55MVM7
8MV4OMV4PMVM2
9-191-1.2
10-13.3-10-10.3

Selecting a column conditionally#

column = table.column_containing_strings(["Private new housing"])
preview(column_c, bounded="B2:D10") 
I am column C

Table 3a

BCD
2
3
4
5Public new housingPrivate new housingTotal new housing
6MV5HMV5IMVM3
7MV54MV55MVM7
8MV4OMV4PMVM2
9-191-1.2
10-13.3-10-10.3

There’s a point worth understanding in this example, the latter (find by expected contents) is typically more robust than the formcer (explciit reference) as its a good chance of contiuing to work even fi the layout shifts.

Selecting a single cell conditionally#

Very much what is sounds like, in this case you only pass one string because you’re only even looking for one match.

cell = table.cell_containing_string("MV5I").label_as("Single cell selection")
preview(cell, bounded="B4:D8") 
Single cell selection

Table 3a

BCD
4
5Public new housingPrivate new housingTotal new housing
6MV5HMV5IMVM3
7MV54MV55MVM7
8MV4OMV4PMVM2

Understanding expand_to_box()#

For virtual all tabulated datasource the observations (the actual values) appear in a box or rectangular, shape - as shown below

preview(table.excel_ref("B9:E12").label_as("Box like selection of value cells"), bounded="A5:E12")
Box like selection of value cells

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

    1. The way you instinctively want to read the above blue selection is the same way all human beings do,starting from the top left of the blue box your eyes moves left to right, top to bottom - i.e the same way you’d read a book.

    1. .expand_to_box() mirrors this quirk of how we read, so it:

      • Starts with a single cell (the top-left part of the box)

      • Sweeps right gathering all the cells

      • Sweeps down gathering all cells beneath those cells.

      • Removes any cells that are blank

example follows:

start_cell = table.excel_ref("B9")
box_of_cells = start_cell.expand_to_box().label_as("Box like selection of value cells")
preview(box_of_cells, bounded="A5:E12")
Box like selection of value cells

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

…wait…. can’t we just use that handy excel_ref() thing that keeps cropping up?

Yes you 100% can ….. but…. that’s an explicit reference, so your code breaks the second someone moves whatever you’re targetting so there’s a trade off to consider.

Excel ref is convenient and very useful thing you’ll almost certainly be making use of, but it (and all explicit references) carry an inherent fragility. Decided where to be explicit vs conditional with your selections is very context dependent but always something you should be considering when working with tidychef.

Speaking of…..

Excel ref#

This is analogous to when you click and drag on a spreadhseet to select a rand of cells, examples follow:

preview(table.excel_ref("B9"), bounded="A5:E12")
Unnamed Selection: 0

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

preview(table.excel_ref("B9:D11"), bounded="A5:E12")
Unnamed Selection: 0

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

preview(table.excel_ref("8:10"), bounded="A5:E12")
Unnamed Selection: 0

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

preview(table.excel_ref("D:E"), bounded="A5:E12")
Unnamed Selection: 0

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

Is Numeric and Is Not Numeric#

These common methods are convenient aliases two very popular filters (more on filters later) and they work largely as you’d image.

numeric = table.is_numeric().label_as("Is numeric")
preview(numeric, bounded="A5:E12")
Is numeric

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1

not_numeric = table.is_not_numeric().label_as("In not numeric")
preview(not_numeric, bounded="A5:E12")
In not numeric

Table 3a

ABCDE
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5J
7Quarterly dataset identifier codeMV54MV55MVM7MV56
8Monthly dataset identifier codeMV4OMV4PMVM2MV4Q
91998-191-1.2-2.8
101999-13.3-10-10.3-2.8
11200025.511.712.9-6.2
1220012.3-6.7-5.87.1