Iterating Tables/Sheets#
Here we’ll talk about working with iterable data sources - such as extracting and joining data taken from multiple tables from a single spreadheet.
Source Data#
The data source we’re using for these examples is shown below:
Note - this particular table has some very verbose headers we don’t care about, so we’ll be using |
---|
The full data source can be downloaded here.
For this example we’ll be using the following tables:
The 4th table named “Table 1a”.
The 5th table named “Table 1b”.
The principle difference between the tables is 1a is “seasonally adjusted” and 1b is not.
For the sake of practicality we’ll only be extracting observations down to row 12.
from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 1a|Table 1b")
preview(tables[0], bounded="A1:H12")
preview(tables[1], bounded="A1:H12")
Table 1a
A | B | C | D | E | F | G | H | |
1 | Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector | |||||||
2 | This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance. | |||||||
3 | Source: Construction Output and Employment from the Office for National Statistics | |||||||
4 | 2019=100 | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MV36 | MV37 | MVL7 | MV38 | MV39 | MV3A | MV3B |
7 | 1997 | 30.8 | 44.8 | 42.6 | 61.2 | 57.6 | 152.1 | 84.3 |
8 | 1998 | 24.9 | 45.3 | 42 | 59.5 | 60.7 | 155 | 91.4 |
9 | 1999 | 21.6 | 40.7 | 37.7 | 57.9 | 68.3 | 159.9 | 102.3 |
10 | 2000 | 27.1 | 45.5 | 42.6 | 54.3 | 64.7 | 142.7 | 103.1 |
11 | 2001 | 27.7 | 42.5 | 40.1 | 58.1 | 65.3 | 145.8 | 102.3 |
12 | 2002 | 31.4 | 46.3 | 43.9 | 65.7 | 82.6 | 115.7 | 105.7 |
Table 1b
A | B | C | D | E | F | G | H | |
1 | Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector | |||||||
2 | This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance. | |||||||
3 | Source: Construction Output and Employment from the Office for National Statistics | |||||||
4 | 2019=100 | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MV3J | MV3K | MVL8 | MV3L | MV3M | MV3N | MV3O |
7 | 1997 | 30.7 | 45.5 | 43.3 | 60.7 | 56.7 | 149.8 | 82.4 |
8 | 1998 | 24.8 | 46 | 42.8 | 59 | 59.7 | 152.5 | 89.3 |
9 | 1999 | 21.6 | 41.5 | 38.5 | 57.6 | 67.6 | 158.1 | 100.4 |
10 | 2000 | 27.1 | 46.3 | 43.4 | 53.9 | 63.8 | 140.7 | 100.9 |
11 | 2001 | 27.7 | 43.2 | 40.9 | 57.7 | 64.4 | 143.8 | 100.2 |
12 | 2002 | 31.4 | 47 | 44.7 | 65.2 | 81.4 | 113.9 | 103.4 |
An Iterated Extraction#
In this example we’re going to
Iterate through the sheets
Extract data from the two sheets in question - adding a column to indicate whether the data is seasonally adjusted.
Join the data into a single TidyData putput.
from tidychef import acquire, preview
from tidychef.direction import right, down
from tidychef.output import Column, TidyData
from tidychef.selection import XlsxSelectable
tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 1a|Table 1b")
# An empty list to hold our tables
tidy_data_list = []
# Now iterate and extract
for table in tables:
anchor = table.excel_ref("A5").label_as("Anchor Cell")
observations = table.excel_ref("B7:H12").label_as("Observations")
housing = anchor.fill(right).label_as("Housing")
dataset_identifier_codes = housing.shift(down).label_as("Data Identifier Codes")
period = anchor.shift(down(2)).expand(down).label_as("Period")
# We're not gonna set a variable based on the contents of cell A1
# this is what tells us if its SA of NSA
a1_cell_value = table.excel_ref("A1").lone_value()
is_seasonally_adjusted = "False" if "non-season" in a1_cell_value else "True"
# Preview selections to sanity check
# we'll include the anchor cell
preview(anchor, observations, housing, dataset_identifier_codes, period, bounded="A1:H12")
tidy_data = TidyData(
observations,
Column(housing.attach_directly(down)),
Column(dataset_identifier_codes.attach_directly(down)),
Column(period.attach_directly(right)),
Column.constant("Seasonally Adjusted", is_seasonally_adjusted)
)
# Now append the tidy data for this sheet to our list
tidy_data_list.append(tidy_data)
# concatenate the list and print our new output
all_tidy_data = TidyData.from_tidy_list(tidy_data_list)
print(all_tidy_data)
Anchor Cell |
Observations |
Housing |
Data Identifier Codes |
Period |
Table 1a
A | B | C | D | E | F | G | H | |
1 | Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector | |||||||
2 | This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance. | |||||||
3 | Source: Construction Output and Employment from the Office for National Statistics | |||||||
4 | 2019=100 | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MV36 | MV37 | MVL7 | MV38 | MV39 | MV3A | MV3B |
7 | 1997 | 30.8 | 44.8 | 42.6 | 61.2 | 57.6 | 152.1 | 84.3 |
8 | 1998 | 24.9 | 45.3 | 42 | 59.5 | 60.7 | 155 | 91.4 |
9 | 1999 | 21.6 | 40.7 | 37.7 | 57.9 | 68.3 | 159.9 | 102.3 |
10 | 2000 | 27.1 | 45.5 | 42.6 | 54.3 | 64.7 | 142.7 | 103.1 |
11 | 2001 | 27.7 | 42.5 | 40.1 | 58.1 | 65.3 | 145.8 | 102.3 |
12 | 2002 | 31.4 | 46.3 | 43.9 | 65.7 | 82.6 | 115.7 | 105.7 |
Anchor Cell |
Observations |
Housing |
Data Identifier Codes |
Period |
Table 1b
A | B | C | D | E | F | G | H | |
1 | Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector | |||||||
2 | This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance. | |||||||
3 | Source: Construction Output and Employment from the Office for National Statistics | |||||||
4 | 2019=100 | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MV3J | MV3K | MVL8 | MV3L | MV3M | MV3N | MV3O |
7 | 1997 | 30.7 | 45.5 | 43.3 | 60.7 | 56.7 | 149.8 | 82.4 |
8 | 1998 | 24.8 | 46 | 42.8 | 59 | 59.7 | 152.5 | 89.3 |
9 | 1999 | 21.6 | 41.5 | 38.5 | 57.6 | 67.6 | 158.1 | 100.4 |
10 | 2000 | 27.1 | 46.3 | 43.4 | 53.9 | 63.8 | 140.7 | 100.9 |
11 | 2001 | 27.7 | 43.2 | 40.9 | 57.7 | 64.4 | 143.8 | 100.2 |
12 | 2002 | 31.4 | 47 | 44.7 | 65.2 | 81.4 | 113.9 | 103.4 |
Observations | Housing | Data Identifier Codes | Period | Seasonally Adjusted |
30.8 | Public new housing | MV36 | 1997 | True |
44.8 | Private new housing | MV37 | 1997 | True |
42.6 | Total new housing | MVL7 | 1997 | True |
61.2 | Infrastructure new work | MV38 | 1997 | True |
57.6 | Public other new work | MV39 | 1997 | True |
152.1 | Private industrial new work | MV3A | 1997 | True |
84.3 | Private commercial new work | MV3B | 1997 | True |
24.9 | Public new housing | MV36 | 1998 | True |
45.3 | Private new housing | MV37 | 1998 | True |
42 | Total new housing | MVL7 | 1998 | True |
59.5 | Infrastructure new work | MV38 | 1998 | True |
60.7 | Public other new work | MV39 | 1998 | True |
155 | Private industrial new work | MV3A | 1998 | True |
91.4 | Private commercial new work | MV3B | 1998 | True |
21.6 | Public new housing | MV36 | 1999 | True |
40.7 | Private new housing | MV37 | 1999 | True |
37.7 | Total new housing | MVL7 | 1999 | True |
57.9 | Infrastructure new work | MV38 | 1999 | True |
68.3 | Public other new work | MV39 | 1999 | True |
159.9 | Private industrial new work | MV3A | 1999 | True |
102.3 | Private commercial new work | MV3B | 1999 | True |
27.1 | Public new housing | MV36 | 2000 | True |
45.5 | Private new housing | MV37 | 2000 | True |
42.6 | Total new housing | MVL7 | 2000 | True |
54.3 | Infrastructure new work | MV38 | 2000 | True |
64.7 | Public other new work | MV39 | 2000 | True |
142.7 | Private industrial new work | MV3A | 2000 | True |
103.1 | Private commercial new work | MV3B | 2000 | True |
27.7 | Public new housing | MV36 | 2001 | True |
42.5 | Private new housing | MV37 | 2001 | True |
40.1 | Total new housing | MVL7 | 2001 | True |
58.1 | Infrastructure new work | MV38 | 2001 | True |
65.3 | Public other new work | MV39 | 2001 | True |
145.8 | Private industrial new work | MV3A | 2001 | True |
102.3 | Private commercial new work | MV3B | 2001 | True |
31.4 | Public new housing | MV36 | 2002 | True |
46.3 | Private new housing | MV37 | 2002 | True |
43.9 | Total new housing | MVL7 | 2002 | True |
65.7 | Infrastructure new work | MV38 | 2002 | True |
82.6 | Public other new work | MV39 | 2002 | True |
115.7 | Private industrial new work | MV3A | 2002 | True |
105.7 | Private commercial new work | MV3B | 2002 | True |
30.7 | Public new housing | MV3J | 1997 | False |
45.5 | Private new housing | MV3K | 1997 | False |
43.3 | Total new housing | MVL8 | 1997 | False |
60.7 | Infrastructure new work | MV3L | 1997 | False |
56.7 | Public other new work | MV3M | 1997 | False |
149.8 | Private industrial new work | MV3N | 1997 | False |
82.4 | Private commercial new work | MV3O | 1997 | False |
24.8 | Public new housing | MV3J | 1998 | False |
46 | Private new housing | MV3K | 1998 | False |
42.8 | Total new housing | MVL8 | 1998 | False |
59 | Infrastructure new work | MV3L | 1998 | False |
59.7 | Public other new work | MV3M | 1998 | False |
152.5 | Private industrial new work | MV3N | 1998 | False |
89.3 | Private commercial new work | MV3O | 1998 | False |
21.6 | Public new housing | MV3J | 1999 | False |
41.5 | Private new housing | MV3K | 1999 | False |
38.5 | Total new housing | MVL8 | 1999 | False |
57.6 | Infrastructure new work | MV3L | 1999 | False |
67.6 | Public other new work | MV3M | 1999 | False |
158.1 | Private industrial new work | MV3N | 1999 | False |
100.4 | Private commercial new work | MV3O | 1999 | False |
27.1 | Public new housing | MV3J | 2000 | False |
46.3 | Private new housing | MV3K | 2000 | False |
43.4 | Total new housing | MVL8 | 2000 | False |
53.9 | Infrastructure new work | MV3L | 2000 | False |
63.8 | Public other new work | MV3M | 2000 | False |
140.7 | Private industrial new work | MV3N | 2000 | False |
100.9 | Private commercial new work | MV3O | 2000 | False |
27.7 | Public new housing | MV3J | 2001 | False |
43.2 | Private new housing | MV3K | 2001 | False |
40.9 | Total new housing | MVL8 | 2001 | False |
57.7 | Infrastructure new work | MV3L | 2001 | False |
64.4 | Public other new work | MV3M | 2001 | False |
143.8 | Private industrial new work | MV3N | 2001 | False |
100.2 | Private commercial new work | MV3O | 2001 | False |
31.4 | Public new housing | MV3J | 2002 | False |
47 | Private new housing | MV3K | 2002 | False |
44.7 | Total new housing | MVL8 | 2002 | False |
65.2 | Infrastructure new work | MV3L | 2002 | False |
81.4 | Public other new work | MV3M | 2002 | False |
113.9 | Private industrial new work | MV3N | 2002 | False |
103.4 | Private commercial new work | MV3O | 2002 | False |