TidyData: Concatenation#
This section details how you can concatenate multiple tidy data sheets together when you output them.
Source Data#
The data source we’re using for these examples is shown below:
Note - this particular table has some very verbose headers we don’t care about, so we’ll be using |
---|
The full data source can be downloaded here. We’ll be using th 10th tab named “Table 3c”.
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A4:H10")
Table 3c
A | B | C | D | E | F | G | H | |
4 | Percentage change 3 months on previous 3 months | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MVO6 | MVO7 | MVO8 | MVO9 | MVP2 | MVP3 | MVP4 |
7 | Jun 2010 | 5.6 | 9.8 | 8.8 | 3 | 4.3 | 3.7 | 1.9 |
8 | Jul 2010 | 2 | 5.6 | 4.8 | 0.2 | -0.2 | 9.7 | 3.5 |
9 | Aug 2010 | 5.5 | 4.5 | 4.7 | -2.9 | -2.9 | 24.4 | 5.9 |
10 | Sep 2010 | 11.7 | 7.5 | 8.5 | -6.8 | -3.3 | 16.1 | 5.3 |
The “Two” Sources#
To keep these examples reasonably brief we’re going to create two TidyData class from the above as follow:
one with observations from B7 though H8
one with observations from B9 through H10
from tidychef import acquire, preview
from tidychef.direction import right, down
from tidychef.output import TidyData, Column
from tidychef.selection import XlsxSelectable
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
houses = table.excel_ref("B5").expand(right).label_as("Houses")
dataset_identifier_code = houses.shift(down).label_as("Dataset Identifier Code")
period = table.excel_ref("A7").expand(down).label_as("Period")
observations1 = table.excel_ref("B7:H8").label_as("Observations")
observations2 = table.excel_ref("B9:H10").label_as("Observations")
## Preview Both Sets of Selection
print("Preview selections for tidydata1")
preview(observations1, houses, dataset_identifier_code, period, bounded="A4:H10")
print("Preview selections for tidydata1")
preview(observations2, houses, dataset_identifier_code, period, bounded="A4:H10")
Preview selections for tidydata1
Observations |
Houses |
Dataset Identifier Code |
Period |
Table 3c
A | B | C | D | E | F | G | H | |
4 | Percentage change 3 months on previous 3 months | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MVO6 | MVO7 | MVO8 | MVO9 | MVP2 | MVP3 | MVP4 |
7 | Jun 2010 | 5.6 | 9.8 | 8.8 | 3 | 4.3 | 3.7 | 1.9 |
8 | Jul 2010 | 2 | 5.6 | 4.8 | 0.2 | -0.2 | 9.7 | 3.5 |
9 | Aug 2010 | 5.5 | 4.5 | 4.7 | -2.9 | -2.9 | 24.4 | 5.9 |
10 | Sep 2010 | 11.7 | 7.5 | 8.5 | -6.8 | -3.3 | 16.1 | 5.3 |
Preview selections for tidydata1
Observations |
Houses |
Dataset Identifier Code |
Period |
Table 3c
A | B | C | D | E | F | G | H | |
4 | Percentage change 3 months on previous 3 months | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MVO6 | MVO7 | MVO8 | MVO9 | MVP2 | MVP3 | MVP4 |
7 | Jun 2010 | 5.6 | 9.8 | 8.8 | 3 | 4.3 | 3.7 | 1.9 |
8 | Jul 2010 | 2 | 5.6 | 4.8 | 0.2 | -0.2 | 9.7 | 3.5 |
9 | Aug 2010 | 5.5 | 4.5 | 4.7 | -2.9 | -2.9 | 24.4 | 5.9 |
10 | Sep 2010 | 11.7 | 7.5 | 8.5 | -6.8 | -3.3 | 16.1 | 5.3 |
Create and preview TidyData objects#
tidydata1 = TidyData(
observations1,
Column(period.attach_directly(right)),
Column(dataset_identifier_code.attach_directly(down)),
Column(houses.attach_directly(down))
)
print("Preview tidydata1")
print(tidydata1)
tidydata2 = TidyData(
observations2,
Column(period.attach_directly(right)),
Column(dataset_identifier_code.attach_directly(down)),
Column(houses.attach_directly(down))
)
print("Preview tidydata2")
print(tidydata2)
Preview tidydata1
Observations | Period | Dataset Identifier Code | Houses |
5.6 | Jun 2010 | MVO6 | Public new housing |
9.8 | Jun 2010 | MVO7 | Private new housing |
8.8 | Jun 2010 | MVO8 | Total new housing |
3 | Jun 2010 | MVO9 | Infrastructure new work |
4.3 | Jun 2010 | MVP2 | Public other new work |
3.7 | Jun 2010 | MVP3 | Private industrial new work |
1.9 | Jun 2010 | MVP4 | Private commercial new work |
2 | Jul 2010 | MVO6 | Public new housing |
5.6 | Jul 2010 | MVO7 | Private new housing |
4.8 | Jul 2010 | MVO8 | Total new housing |
0.2 | Jul 2010 | MVO9 | Infrastructure new work |
-0.2 | Jul 2010 | MVP2 | Public other new work |
9.7 | Jul 2010 | MVP3 | Private industrial new work |
3.5 | Jul 2010 | MVP4 | Private commercial new work |
Preview tidydata2
Observations | Period | Dataset Identifier Code | Houses |
5.5 | Aug 2010 | MVO6 | Public new housing |
4.5 | Aug 2010 | MVO7 | Private new housing |
4.7 | Aug 2010 | MVO8 | Total new housing |
-2.9 | Aug 2010 | MVO9 | Infrastructure new work |
-2.9 | Aug 2010 | MVP2 | Public other new work |
24.4 | Aug 2010 | MVP3 | Private industrial new work |
5.9 | Aug 2010 | MVP4 | Private commercial new work |
11.7 | Sep 2010 | MVO6 | Public new housing |
7.5 | Sep 2010 | MVO7 | Private new housing |
8.5 | Sep 2010 | MVO8 | Total new housing |
-6.8 | Sep 2010 | MVO9 | Infrastructure new work |
-3.3 | Sep 2010 | MVP2 | Public other new work |
16.1 | Sep 2010 | MVP3 | Private industrial new work |
5.3 | Sep 2010 | MVP4 | Private commercial new work |
Concatenation via ‘+’#
The simplest way to concatenate two TidyData objects is just to add them together.
tidy_data_all = tidydata1+tidydata2
print(tidy_data_all)
Observations | Period | Dataset Identifier Code | Houses |
5.6 | Jun 2010 | MVO6 | Public new housing |
9.8 | Jun 2010 | MVO7 | Private new housing |
8.8 | Jun 2010 | MVO8 | Total new housing |
3 | Jun 2010 | MVO9 | Infrastructure new work |
4.3 | Jun 2010 | MVP2 | Public other new work |
3.7 | Jun 2010 | MVP3 | Private industrial new work |
1.9 | Jun 2010 | MVP4 | Private commercial new work |
2 | Jul 2010 | MVO6 | Public new housing |
5.6 | Jul 2010 | MVO7 | Private new housing |
4.8 | Jul 2010 | MVO8 | Total new housing |
0.2 | Jul 2010 | MVO9 | Infrastructure new work |
-0.2 | Jul 2010 | MVP2 | Public other new work |
9.7 | Jul 2010 | MVP3 | Private industrial new work |
3.5 | Jul 2010 | MVP4 | Private commercial new work |
5.5 | Aug 2010 | MVO6 | Public new housing |
4.5 | Aug 2010 | MVO7 | Private new housing |
4.7 | Aug 2010 | MVO8 | Total new housing |
-2.9 | Aug 2010 | MVO9 | Infrastructure new work |
-2.9 | Aug 2010 | MVP2 | Public other new work |
24.4 | Aug 2010 | MVP3 | Private industrial new work |
5.9 | Aug 2010 | MVP4 | Private commercial new work |
11.7 | Sep 2010 | MVO6 | Public new housing |
7.5 | Sep 2010 | MVO7 | Private new housing |
8.5 | Sep 2010 | MVO8 | Total new housing |
-6.8 | Sep 2010 | MVO9 | Infrastructure new work |
-3.3 | Sep 2010 | MVP2 | Public other new work |
16.1 | Sep 2010 | MVP3 | Private industrial new work |
5.3 | Sep 2010 | MVP4 | Private commercial new work |
Concatenation via a list#
The next method is to use the TidyData.from_tidy_list()
constructor.
tidy_data_all = TidyData.from_tidy_list([tidydata1, tidydata2])
print(tidy_data_all)
Observations | Period | Dataset Identifier Code | Houses |
5.6 | Jun 2010 | MVO6 | Public new housing |
9.8 | Jun 2010 | MVO7 | Private new housing |
8.8 | Jun 2010 | MVO8 | Total new housing |
3 | Jun 2010 | MVO9 | Infrastructure new work |
4.3 | Jun 2010 | MVP2 | Public other new work |
3.7 | Jun 2010 | MVP3 | Private industrial new work |
1.9 | Jun 2010 | MVP4 | Private commercial new work |
2 | Jul 2010 | MVO6 | Public new housing |
5.6 | Jul 2010 | MVO7 | Private new housing |
4.8 | Jul 2010 | MVO8 | Total new housing |
0.2 | Jul 2010 | MVO9 | Infrastructure new work |
-0.2 | Jul 2010 | MVP2 | Public other new work |
9.7 | Jul 2010 | MVP3 | Private industrial new work |
3.5 | Jul 2010 | MVP4 | Private commercial new work |
5.5 | Aug 2010 | MVO6 | Public new housing |
4.5 | Aug 2010 | MVO7 | Private new housing |
4.7 | Aug 2010 | MVO8 | Total new housing |
-2.9 | Aug 2010 | MVO9 | Infrastructure new work |
-2.9 | Aug 2010 | MVP2 | Public other new work |
24.4 | Aug 2010 | MVP3 | Private industrial new work |
5.9 | Aug 2010 | MVP4 | Private commercial new work |
11.7 | Sep 2010 | MVO6 | Public new housing |
7.5 | Sep 2010 | MVO7 | Private new housing |
8.5 | Sep 2010 | MVO8 | Total new housing |
-6.8 | Sep 2010 | MVO9 | Infrastructure new work |
-3.3 | Sep 2010 | MVP2 | Public other new work |
16.1 | Sep 2010 | MVP3 | Private industrial new work |
5.3 | Sep 2010 | MVP4 | Private commercial new work |
Concatenation via arguments#
Lastly, you can pass in objects of class TidyData as positional arguments to the TidyData.from_tidy()
constructor.
tidy_data_all = TidyData.from_tidy(tidydata1, tidydata2)
print(tidy_data_all)
Observations | Period | Dataset Identifier Code | Houses |
5.6 | Jun 2010 | MVO6 | Public new housing |
9.8 | Jun 2010 | MVO7 | Private new housing |
8.8 | Jun 2010 | MVO8 | Total new housing |
3 | Jun 2010 | MVO9 | Infrastructure new work |
4.3 | Jun 2010 | MVP2 | Public other new work |
3.7 | Jun 2010 | MVP3 | Private industrial new work |
1.9 | Jun 2010 | MVP4 | Private commercial new work |
2 | Jul 2010 | MVO6 | Public new housing |
5.6 | Jul 2010 | MVO7 | Private new housing |
4.8 | Jul 2010 | MVO8 | Total new housing |
0.2 | Jul 2010 | MVO9 | Infrastructure new work |
-0.2 | Jul 2010 | MVP2 | Public other new work |
9.7 | Jul 2010 | MVP3 | Private industrial new work |
3.5 | Jul 2010 | MVP4 | Private commercial new work |
5.5 | Aug 2010 | MVO6 | Public new housing |
4.5 | Aug 2010 | MVO7 | Private new housing |
4.7 | Aug 2010 | MVO8 | Total new housing |
-2.9 | Aug 2010 | MVO9 | Infrastructure new work |
-2.9 | Aug 2010 | MVP2 | Public other new work |
24.4 | Aug 2010 | MVP3 | Private industrial new work |
5.9 | Aug 2010 | MVP4 | Private commercial new work |
11.7 | Sep 2010 | MVO6 | Public new housing |
7.5 | Sep 2010 | MVO7 | Private new housing |
8.5 | Sep 2010 | MVO8 | Total new housing |
-6.8 | Sep 2010 | MVO9 | Infrastructure new work |
-3.3 | Sep 2010 | MVP2 | Public other new work |
16.1 | Sep 2010 | MVP3 | Private industrial new work |
5.3 | Sep 2010 | MVP4 | Private commercial new work |
Misaligned Headers Error#
Where concatenating together TidyData the column headers must match precisely.
The following is an example of what happens if you try and concatenate two TidyData classes with misaligned headers.
# We're going to redefine tidydata2 but this time we're going to label the observations column
# such that it no longer matches tidydata1
tidydata2 = TidyData(
observations2.label_as("Value"),
Column(period.attach_directly(right)),
Column(dataset_identifier_code.attach_directly(down)),
Column(houses.attach_directly(down))
)
# And we'll preview both so you can see the column headers no longer match
print("Preview tidydata1")
print(tidydata1)
print("Preview tidydata2")
print(tidydata2)
Preview tidydata1
Observations | Period | Dataset Identifier Code | Houses |
5.6 | Jun 2010 | MVO6 | Public new housing |
9.8 | Jun 2010 | MVO7 | Private new housing |
8.8 | Jun 2010 | MVO8 | Total new housing |
3 | Jun 2010 | MVO9 | Infrastructure new work |
4.3 | Jun 2010 | MVP2 | Public other new work |
3.7 | Jun 2010 | MVP3 | Private industrial new work |
1.9 | Jun 2010 | MVP4 | Private commercial new work |
2 | Jul 2010 | MVO6 | Public new housing |
5.6 | Jul 2010 | MVO7 | Private new housing |
4.8 | Jul 2010 | MVO8 | Total new housing |
0.2 | Jul 2010 | MVO9 | Infrastructure new work |
-0.2 | Jul 2010 | MVP2 | Public other new work |
9.7 | Jul 2010 | MVP3 | Private industrial new work |
3.5 | Jul 2010 | MVP4 | Private commercial new work |
Preview tidydata2
Value | Period | Dataset Identifier Code | Houses |
5.5 | Aug 2010 | MVO6 | Public new housing |
4.5 | Aug 2010 | MVO7 | Private new housing |
4.7 | Aug 2010 | MVO8 | Total new housing |
-2.9 | Aug 2010 | MVO9 | Infrastructure new work |
-2.9 | Aug 2010 | MVP2 | Public other new work |
24.4 | Aug 2010 | MVP3 | Private industrial new work |
5.9 | Aug 2010 | MVP4 | Private commercial new work |
11.7 | Sep 2010 | MVO6 | Public new housing |
7.5 | Sep 2010 | MVO7 | Private new housing |
8.5 | Sep 2010 | MVO8 | Total new housing |
-6.8 | Sep 2010 | MVO9 | Infrastructure new work |
-3.3 | Sep 2010 | MVP2 | Public other new work |
16.1 | Sep 2010 | MVP3 | Private industrial new work |
5.3 | Sep 2010 | MVP4 | Private commercial new work |
# Now we try and concatenate them - and get an error!
tidydata1 + tidydata2
---------------------------------------------------------------------------
MisalignedHeadersError Traceback (most recent call last)
Cell In[9], line 3
1 # Now we try and concatenate them - and get an error!
----> 3 tidydata1 + tidydata2
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/utils/decorators/dontmutate.py:24, in dontmutate.<locals>.wrapper(self, *args, **kwargs)
21 @wraps(method)
22 def wrapper(self, *args, **kwargs):
23 self = copy.deepcopy(self)
---> 24 return method(self, *args, **kwargs)
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:223, in TidyData.__add__(self, other_tidy_data)
220 # Error if we're joining two TidyData objects
221 # with different headers
222 if self._data[0] != other_tidy_data._data[0]:
--> 223 raise MisalignedHeadersError(
224 f"""
225 You are attempting to sum two tidy data
226 outputs but they do not have the same
227 column headers.
228
229 TidyData1 headers:
230 {self._data[0]}
231
232 TidyData2 headers:
233 {other_tidy_data._data[0]}
234 """
235 )
237 # Since the headers match, join all but the header
238 # row from the new source
239 self._data += other_tidy_data._data[1:]
MisalignedHeadersError:
You are attempting to sum two tidy data
outputs but they do not have the same
column headers.
TidyData1 headers:
[(VIRTUAL CELL, value:"Observations"), (VIRTUAL CELL, value:"Period"), (VIRTUAL CELL, value:"Dataset Identifier Code"), (VIRTUAL CELL, value:"Houses")]
TidyData2 headers:
[(VIRTUAL CELL, value:"Value"), (VIRTUAL CELL, value:"Period"), (VIRTUAL CELL, value:"Dataset Identifier Code"), (VIRTUAL CELL, value:"Houses")]