Iterating Tables/Sheets

Iterating Tables/Sheets#

Here we’ll talk about working with iterable data sources - such as extracting and joining data taken from multiple tables from a single spreadheet.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here.

For this example we’ll be using the following tables:

  • The 4th table named “Table 1a”.

  • The 5th table named “Table 1b”.

The principle difference between the tables is 1a is “seasonally adjusted” and 1b is not.

For the sake of practicality we’ll only be extracting observations down to row 12.

from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 1a|Table 1b")
preview(tables[0], bounded="A1:H12")
preview(tables[1], bounded="A1:H12")

Table 1a

ABCDEFGH
1Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector
2This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.
3Source: Construction Output and Employment from the Office for National Statistics
42019=100
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMV36MV37MVL7MV38MV39MV3AMV3B
7199730.844.842.661.257.6152.184.3
8199824.945.34259.560.715591.4
9199921.640.737.757.968.3159.9102.3
10200027.145.542.654.364.7142.7103.1
11200127.742.540.158.165.3145.8102.3
12200231.446.343.965.782.6115.7105.7

Table 1b

ABCDEFGH
1Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector
2This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.
3Source: Construction Output and Employment from the Office for National Statistics
42019=100
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMV3JMV3KMVL8MV3LMV3MMV3NMV3O
7199730.745.543.360.756.7149.882.4
8199824.84642.85959.7152.589.3
9199921.641.538.557.667.6158.1100.4
10200027.146.343.453.963.8140.7100.9
11200127.743.240.957.764.4143.8100.2
12200231.44744.765.281.4113.9103.4

An Iterated Extraction#

In this example we’re going to

  • Iterate through the sheets

  • Extract data from the two sheets in question - adding a column to indicate whether the data is seasonally adjusted.

  • Join the data into a single TidyData putput.

from tidychef import acquire, preview
from tidychef.direction import right, down
from tidychef.output import Column, TidyData
from tidychef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 1a|Table 1b")

# An empty list to hold our tables
tidy_data_list = []

# Now iterate and extract
for table in tables:
    anchor = table.excel_ref("A5").label_as("Anchor Cell")
    observations = table.excel_ref("B7:H12").label_as("Observations")
    housing = anchor.fill(right).label_as("Housing")
    dataset_identifier_codes = housing.shift(down).label_as("Data Identifier Codes")
    period = anchor.shift(down(2)).expand(down).label_as("Period")

    # We're not gonna set a variable based on the contents of cell A1
    # this is what tells us if its SA of NSA
    a1_cell_value = table.excel_ref("A1").lone_value()
    is_seasonally_adjusted = "False" if "non-season" in a1_cell_value else "True"

    # Preview selections to sanity check
    # we'll include the anchor cell
    preview(anchor, observations, housing, dataset_identifier_codes, period, bounded="A1:H12")

    tidy_data = TidyData(
        observations,
        Column(housing.attach_directly(down)),
        Column(dataset_identifier_codes.attach_directly(down)),
        Column(period.attach_directly(right)),
        Column.constant("Seasonally Adjusted", is_seasonally_adjusted)
    )
    
    # Now append the tidy data for this sheet to our list
    tidy_data_list.append(tidy_data)
    

# concatenate the list and print our new output
all_tidy_data = TidyData.from_tidy_list(tidy_data_list)
print(all_tidy_data)
Anchor Cell
Observations
Housing
Data Identifier Codes
Period

Table 1a

ABCDEFGH
1Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector
2This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.
3Source: Construction Output and Employment from the Office for National Statistics
42019=100
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMV36MV37MVL7MV38MV39MV3AMV3B
7199730.844.842.661.257.6152.184.3
8199824.945.34259.560.715591.4
9199921.640.737.757.968.3159.9102.3
10200027.145.542.654.364.7142.7103.1
11200127.742.540.158.165.3145.8102.3
12200231.446.343.965.782.6115.7105.7

Anchor Cell
Observations
Housing
Data Identifier Codes
Period

Table 1b

ABCDEFGH
1Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector
2This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.
3Source: Construction Output and Employment from the Office for National Statistics
42019=100
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMV3JMV3KMVL8MV3LMV3MMV3NMV3O
7199730.745.543.360.756.7149.882.4
8199824.84642.85959.7152.589.3
9199921.641.538.557.667.6158.1100.4
10200027.146.343.453.963.8140.7100.9
11200127.743.240.957.764.4143.8100.2
12200231.44744.765.281.4113.9103.4

ObservationsHousingData Identifier CodesPeriodSeasonally Adjusted
30.8Public new housingMV361997True
44.8Private new housingMV371997True
42.6Total new housingMVL71997True
61.2Infrastructure new workMV381997True
57.6Public other new workMV391997True
152.1Private industrial new workMV3A1997True
84.3Private commercial new workMV3B1997True
24.9Public new housingMV361998True
45.3Private new housingMV371998True
42Total new housingMVL71998True
59.5Infrastructure new workMV381998True
60.7Public other new workMV391998True
155Private industrial new workMV3A1998True
91.4Private commercial new workMV3B1998True
21.6Public new housingMV361999True
40.7Private new housingMV371999True
37.7Total new housingMVL71999True
57.9Infrastructure new workMV381999True
68.3Public other new workMV391999True
159.9Private industrial new workMV3A1999True
102.3Private commercial new workMV3B1999True
27.1Public new housingMV362000True
45.5Private new housingMV372000True
42.6Total new housingMVL72000True
54.3Infrastructure new workMV382000True
64.7Public other new workMV392000True
142.7Private industrial new workMV3A2000True
103.1Private commercial new workMV3B2000True
27.7Public new housingMV362001True
42.5Private new housingMV372001True
40.1Total new housingMVL72001True
58.1Infrastructure new workMV382001True
65.3Public other new workMV392001True
145.8Private industrial new workMV3A2001True
102.3Private commercial new workMV3B2001True
31.4Public new housingMV362002True
46.3Private new housingMV372002True
43.9Total new housingMVL72002True
65.7Infrastructure new workMV382002True
82.6Public other new workMV392002True
115.7Private industrial new workMV3A2002True
105.7Private commercial new workMV3B2002True
30.7Public new housingMV3J1997False
45.5Private new housingMV3K1997False
43.3Total new housingMVL81997False
60.7Infrastructure new workMV3L1997False
56.7Public other new workMV3M1997False
149.8Private industrial new workMV3N1997False
82.4Private commercial new workMV3O1997False
24.8Public new housingMV3J1998False
46Private new housingMV3K1998False
42.8Total new housingMVL81998False
59Infrastructure new workMV3L1998False
59.7Public other new workMV3M1998False
152.5Private industrial new workMV3N1998False
89.3Private commercial new workMV3O1998False
21.6Public new housingMV3J1999False
41.5Private new housingMV3K1999False
38.5Total new housingMVL81999False
57.6Infrastructure new workMV3L1999False
67.6Public other new workMV3M1999False
158.1Private industrial new workMV3N1999False
100.4Private commercial new workMV3O1999False
27.1Public new housingMV3J2000False
46.3Private new housingMV3K2000False
43.4Total new housingMVL82000False
53.9Infrastructure new workMV3L2000False
63.8Public other new workMV3M2000False
140.7Private industrial new workMV3N2000False
100.9Private commercial new workMV3O2000False
27.7Public new housingMV3J2001False
43.2Private new housingMV3K2001False
40.9Total new housingMVL82001False
57.7Infrastructure new workMV3L2001False
64.4Public other new workMV3M2001False
143.8Private industrial new workMV3N2001False
100.2Private commercial new workMV3O2001False
31.4Public new housingMV3J2002False
47Private new housingMV3K2002False
44.7Total new housingMVL82002False
65.2Infrastructure new workMV3L2002False
81.4Public other new workMV3M2002False
113.9Private industrial new workMV3N2002False
103.4Private commercial new workMV3O2002False