TidyData: Concatenation#

This section details how you can concatenate multiple tidy data sheets together when you output them.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here. We’ll be using th 10th tab named “Table 3c”.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A4:H10")

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3

The “Two” Sources#

To keep these examples reasonably brief we’re going to create two TidyData class from the above as follow:

  • one with observations from B7 though H8

  • one with observations from B9 through H10

from tidychef import acquire, preview
from tidychef.direction import right, down
from tidychef.output import TidyData, Column
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")


houses = table.excel_ref("B5").expand(right).label_as("Houses")
dataset_identifier_code = houses.shift(down).label_as("Dataset Identifier Code")
period = table.excel_ref("A7").expand(down).label_as("Period")

observations1 = table.excel_ref("B7:H8").label_as("Observations")
observations2 = table.excel_ref("B9:H10").label_as("Observations")
## Preview Both Sets of Selection

print("Preview selections for tidydata1")
preview(observations1, houses, dataset_identifier_code, period, bounded="A4:H10")

print("Preview selections for tidydata1")
preview(observations2, houses, dataset_identifier_code, period, bounded="A4:H10")
Preview selections for tidydata1
Observations
Houses
Dataset Identifier Code
Period

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3

Preview selections for tidydata1
Observations
Houses
Dataset Identifier Code
Period

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3

Create and preview TidyData objects#

tidydata1 = TidyData(
    observations1,
    Column(period.attach_directly(right)),
    Column(dataset_identifier_code.attach_directly(down)),
    Column(houses.attach_directly(down))
)
print("Preview tidydata1")
print(tidydata1)

tidydata2 = TidyData(
    observations2,
    Column(period.attach_directly(right)),
    Column(dataset_identifier_code.attach_directly(down)),
    Column(houses.attach_directly(down))
)
print("Preview tidydata2")
print(tidydata2)
Preview tidydata1
ObservationsPeriodDataset Identifier CodeHouses
5.6Jun 2010MVO6Public new housing
9.8Jun 2010MVO7Private new housing
8.8Jun 2010MVO8Total new housing
3Jun 2010MVO9Infrastructure new work
4.3Jun 2010MVP2Public other new work
3.7Jun 2010MVP3Private industrial new work
1.9Jun 2010MVP4Private commercial new work
2Jul 2010MVO6Public new housing
5.6Jul 2010MVO7Private new housing
4.8Jul 2010MVO8Total new housing
0.2Jul 2010MVO9Infrastructure new work
-0.2Jul 2010MVP2Public other new work
9.7Jul 2010MVP3Private industrial new work
3.5Jul 2010MVP4Private commercial new work

Preview tidydata2
ObservationsPeriodDataset Identifier CodeHouses
5.5Aug 2010MVO6Public new housing
4.5Aug 2010MVO7Private new housing
4.7Aug 2010MVO8Total new housing
-2.9Aug 2010MVO9Infrastructure new work
-2.9Aug 2010MVP2Public other new work
24.4Aug 2010MVP3Private industrial new work
5.9Aug 2010MVP4Private commercial new work
11.7Sep 2010MVO6Public new housing
7.5Sep 2010MVO7Private new housing
8.5Sep 2010MVO8Total new housing
-6.8Sep 2010MVO9Infrastructure new work
-3.3Sep 2010MVP2Public other new work
16.1Sep 2010MVP3Private industrial new work
5.3Sep 2010MVP4Private commercial new work


Concatenation via ‘+’#

The simplest way to concatenate two TidyData objects is just to add them together.

tidy_data_all = tidydata1+tidydata2
print(tidy_data_all)
ObservationsPeriodDataset Identifier CodeHouses
5.6Jun 2010MVO6Public new housing
9.8Jun 2010MVO7Private new housing
8.8Jun 2010MVO8Total new housing
3Jun 2010MVO9Infrastructure new work
4.3Jun 2010MVP2Public other new work
3.7Jun 2010MVP3Private industrial new work
1.9Jun 2010MVP4Private commercial new work
2Jul 2010MVO6Public new housing
5.6Jul 2010MVO7Private new housing
4.8Jul 2010MVO8Total new housing
0.2Jul 2010MVO9Infrastructure new work
-0.2Jul 2010MVP2Public other new work
9.7Jul 2010MVP3Private industrial new work
3.5Jul 2010MVP4Private commercial new work
5.5Aug 2010MVO6Public new housing
4.5Aug 2010MVO7Private new housing
4.7Aug 2010MVO8Total new housing
-2.9Aug 2010MVO9Infrastructure new work
-2.9Aug 2010MVP2Public other new work
24.4Aug 2010MVP3Private industrial new work
5.9Aug 2010MVP4Private commercial new work
11.7Sep 2010MVO6Public new housing
7.5Sep 2010MVO7Private new housing
8.5Sep 2010MVO8Total new housing
-6.8Sep 2010MVO9Infrastructure new work
-3.3Sep 2010MVP2Public other new work
16.1Sep 2010MVP3Private industrial new work
5.3Sep 2010MVP4Private commercial new work


Concatenation via a list#

The next method is to use the TidyData.from_tidy_list() constructor.

tidy_data_all = TidyData.from_tidy_list([tidydata1, tidydata2])
print(tidy_data_all)
ObservationsPeriodDataset Identifier CodeHouses
5.6Jun 2010MVO6Public new housing
9.8Jun 2010MVO7Private new housing
8.8Jun 2010MVO8Total new housing
3Jun 2010MVO9Infrastructure new work
4.3Jun 2010MVP2Public other new work
3.7Jun 2010MVP3Private industrial new work
1.9Jun 2010MVP4Private commercial new work
2Jul 2010MVO6Public new housing
5.6Jul 2010MVO7Private new housing
4.8Jul 2010MVO8Total new housing
0.2Jul 2010MVO9Infrastructure new work
-0.2Jul 2010MVP2Public other new work
9.7Jul 2010MVP3Private industrial new work
3.5Jul 2010MVP4Private commercial new work
5.5Aug 2010MVO6Public new housing
4.5Aug 2010MVO7Private new housing
4.7Aug 2010MVO8Total new housing
-2.9Aug 2010MVO9Infrastructure new work
-2.9Aug 2010MVP2Public other new work
24.4Aug 2010MVP3Private industrial new work
5.9Aug 2010MVP4Private commercial new work
11.7Sep 2010MVO6Public new housing
7.5Sep 2010MVO7Private new housing
8.5Sep 2010MVO8Total new housing
-6.8Sep 2010MVO9Infrastructure new work
-3.3Sep 2010MVP2Public other new work
16.1Sep 2010MVP3Private industrial new work
5.3Sep 2010MVP4Private commercial new work


Concatenation via arguments#

Lastly, you can pass in objects of class TidyData as positional arguments to the TidyData.from_tidy() constructor.

tidy_data_all = TidyData.from_tidy(tidydata1, tidydata2)
print(tidy_data_all)
ObservationsPeriodDataset Identifier CodeHouses
5.6Jun 2010MVO6Public new housing
9.8Jun 2010MVO7Private new housing
8.8Jun 2010MVO8Total new housing
3Jun 2010MVO9Infrastructure new work
4.3Jun 2010MVP2Public other new work
3.7Jun 2010MVP3Private industrial new work
1.9Jun 2010MVP4Private commercial new work
2Jul 2010MVO6Public new housing
5.6Jul 2010MVO7Private new housing
4.8Jul 2010MVO8Total new housing
0.2Jul 2010MVO9Infrastructure new work
-0.2Jul 2010MVP2Public other new work
9.7Jul 2010MVP3Private industrial new work
3.5Jul 2010MVP4Private commercial new work
5.5Aug 2010MVO6Public new housing
4.5Aug 2010MVO7Private new housing
4.7Aug 2010MVO8Total new housing
-2.9Aug 2010MVO9Infrastructure new work
-2.9Aug 2010MVP2Public other new work
24.4Aug 2010MVP3Private industrial new work
5.9Aug 2010MVP4Private commercial new work
11.7Sep 2010MVO6Public new housing
7.5Sep 2010MVO7Private new housing
8.5Sep 2010MVO8Total new housing
-6.8Sep 2010MVO9Infrastructure new work
-3.3Sep 2010MVP2Public other new work
16.1Sep 2010MVP3Private industrial new work
5.3Sep 2010MVP4Private commercial new work


Misaligned Headers Error#

Where concatenating together TidyData the column headers must match precisely.

The following is an example of what happens if you try and concatenate two TidyData classes with misaligned headers.

# We're going to redefine tidydata2 but this time we're going to label the observations column
# such that it no longer matches tidydata1 

tidydata2 = TidyData(
    observations2.label_as("Value"),
    Column(period.attach_directly(right)),
    Column(dataset_identifier_code.attach_directly(down)),
    Column(houses.attach_directly(down))
)

# And we'll preview both so you can see the column headers no longer match
print("Preview tidydata1")
print(tidydata1)

print("Preview tidydata2")
print(tidydata2)
Preview tidydata1
ObservationsPeriodDataset Identifier CodeHouses
5.6Jun 2010MVO6Public new housing
9.8Jun 2010MVO7Private new housing
8.8Jun 2010MVO8Total new housing
3Jun 2010MVO9Infrastructure new work
4.3Jun 2010MVP2Public other new work
3.7Jun 2010MVP3Private industrial new work
1.9Jun 2010MVP4Private commercial new work
2Jul 2010MVO6Public new housing
5.6Jul 2010MVO7Private new housing
4.8Jul 2010MVO8Total new housing
0.2Jul 2010MVO9Infrastructure new work
-0.2Jul 2010MVP2Public other new work
9.7Jul 2010MVP3Private industrial new work
3.5Jul 2010MVP4Private commercial new work

Preview tidydata2
ValuePeriodDataset Identifier CodeHouses
5.5Aug 2010MVO6Public new housing
4.5Aug 2010MVO7Private new housing
4.7Aug 2010MVO8Total new housing
-2.9Aug 2010MVO9Infrastructure new work
-2.9Aug 2010MVP2Public other new work
24.4Aug 2010MVP3Private industrial new work
5.9Aug 2010MVP4Private commercial new work
11.7Sep 2010MVO6Public new housing
7.5Sep 2010MVO7Private new housing
8.5Sep 2010MVO8Total new housing
-6.8Sep 2010MVO9Infrastructure new work
-3.3Sep 2010MVP2Public other new work
16.1Sep 2010MVP3Private industrial new work
5.3Sep 2010MVP4Private commercial new work


# Now we try and concatenate them - and get an error!

tidydata1 + tidydata2
---------------------------------------------------------------------------
MisalignedHeadersError                    Traceback (most recent call last)
Cell In[9], line 3
      1 # Now we try and concatenate them - and get an error!
----> 3 tidydata1 + tidydata2

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/utils/decorators/dontmutate.py:24, in dontmutate.<locals>.wrapper(self, *args, **kwargs)
     21 @wraps(method)
     22 def wrapper(self, *args, **kwargs):
     23     self = copy.deepcopy(self)
---> 24     return method(self, *args, **kwargs)

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:223, in TidyData.__add__(self, other_tidy_data)
    220 # Error if we're joining two TidyData objects
    221 # with different headers
    222 if self._data[0] != other_tidy_data._data[0]:
--> 223     raise MisalignedHeadersError(
    224         f"""
    225         You are attempting to sum two tidy data
    226         outputs but they do not have the same
    227         column headers.
    228 
    229         TidyData1 headers:
    230         {self._data[0]}
    231 
    232         TidyData2 headers:
    233         {other_tidy_data._data[0]}
    234     """
    235     )
    237 # Since the headers match, join all but the header
    238 # row from the new source
    239 self._data += other_tidy_data._data[1:]

MisalignedHeadersError: 
                You are attempting to sum two tidy data
                outputs but they do not have the same
                column headers.

                TidyData1 headers:
                [(VIRTUAL CELL, value:"Observations"), (VIRTUAL CELL, value:"Period"), (VIRTUAL CELL, value:"Dataset Identifier Code"), (VIRTUAL CELL, value:"Houses")]

                TidyData2 headers:
                [(VIRTUAL CELL, value:"Value"), (VIRTUAL CELL, value:"Period"), (VIRTUAL CELL, value:"Dataset Identifier Code"), (VIRTUAL CELL, value:"Houses")]