TidyData: Columns & Direct#

This section gives the first simple examples of how we create TidyData from our selections.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here. We’ll be using th 8th tab named “Table 3a”.

from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3a")
preview(table, bounded="A4:H10")

Table 3a

ABCDEFGH
4Percentage change period on period
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M
7Quarterly dataset identifier codeMV54MV55MVM7MV56MV57MV58MV59
8Monthly dataset identifier codeMV4OMV4PMVM2MV4QMV4RMV4SMV4T
91998-191-1.2-2.85.41.98.4
101999-13.3-10-10.3-2.812.63.212

TidyData & Columns#

This example introduces two new classes.

  • TidyData is a class representing tidy data created from the selections.

  • Column represents a single column of data within this TidyData.

Note - for this section we’re going to use only a small sample of the potential observations we could select. This will look a little odd in the previews but is necessary to restrict the size of the TidyData previews to something practical for the context of this documentation.

Example 1: Time Periods#

The simplest example where we just extract the observations and time period.

A new and critical part will be the following:

tidy_data = TidyData(
    observations,
    Column(period.finds_observations_directly(right))
)

Which specifies the visual relationship between the “period” selection of cells and the “observations” selection of cells is… “observations are directly to the right of period”

from tidychef import acquire, preview
from tidychef.direction import right
from tidychef.selection import XlsxSelectable
from tidychef.output import Column, TidyData

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3a")

# Create our selections
# Note, we're not taking all observations as we want to keep the 
# output suitably small for this example.
observations = table.excel_ref("B9:C18").label_as("Value")
period = table.excel_ref("A9:A18").label_as("Period")

# First preview the selections
preview(observations, period, bounded="A4:H18")

# Then create some simple TidyData
tidy_data = TidyData(
    observations,
    Column(period.attach_directly(right))
)

# And view it
print(tidy_data)
Value
Period

Table 3a

ABCDEFGH
4Percentage change period on period
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M
7Quarterly dataset identifier codeMV54MV55MVM7MV56MV57MV58MV59
8Monthly dataset identifier codeMV4OMV4PMVM2MV4QMV4RMV4SMV4T
91998-191-1.2-2.85.41.98.4
101999-13.3-10-10.3-2.812.63.212
11200025.511.712.9-6.2-5.3-10.70.8
1220012.3-6.7-5.87.10.92.2-0.7
13200213.299.413.126.5-20.73.3
14200313.824.823.6-5.725.55.6-3.6
15200420.121.521.4-12.712.3310.3
162005-5.92.92-4.1-10-2-4.3
17200617.90.31.9-7.8-8.18.48.5
18200715.5-1.60.2-1.4-1.7-2.410.1

ValuePeriod
-191998
11998
-13.31999
-101999
25.52000
11.72000
2.32001
-6.72001
13.22002
92002
13.82003
24.82003
20.12004
21.52004
-5.92005
2.92005
17.92006
0.32006
15.52007
-1.62007


Example 2: Tidy Data#

Now the above example doesn’t accomplish a lot so lets expand our code a little more to add in some more columns.

If you’ve been working the documentation so far you should be able to make sense of the following - there’s nothing new here.

Do take your time and go line by line to make sure you understand this, it’s probably the most key example in this documentation.

from tidychef import acquire, preview
from tidychef.direction import right, down
from tidychef.selection import XlsxSelectable
from tidychef.output import Column, TidyData

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3a")

# Create our selections
observations = table.excel_ref("B9:C18").label_as("Value")
period = table.excel_ref("A9:A18").label_as("Period")
housing = table.excel_ref("B5").expand(right).label_as("Housing")
annual_dataset_code = housing.shift(down).label_as("Annual Dataset Identifier")
quarterly_dataset_code = annual_dataset_code.shift(down).label_as("Quarterly Dataset Identifier")
monthly_dataset_code = quarterly_dataset_code.shift(down).label_as("Monthly Dataset Identifier")

# First preview the selections
preview(observations, housing, annual_dataset_code, quarterly_dataset_code, monthly_dataset_code, period, bounded="A4:H18")

# Then create some simple TidyData
tidy_data = TidyData(
    observations,
    Column(period.attach_directly(right)),
    Column(housing.attach_directly(down)),
    Column(annual_dataset_code.attach_directly(down)),
    Column(quarterly_dataset_code.attach_directly(down)),
    Column(monthly_dataset_code.attach_directly(down))
)

# And view it
print(tidy_data)
Value
Housing
Annual Dataset Identifier
Quarterly Dataset Identifier
Monthly Dataset Identifier
Period

Table 3a

ABCDEFGH
4Percentage change period on period
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Annual dataset identifier codeMV5HMV5IMVM3MV5JMV5KMV5LMV5M
7Quarterly dataset identifier codeMV54MV55MVM7MV56MV57MV58MV59
8Monthly dataset identifier codeMV4OMV4PMVM2MV4QMV4RMV4SMV4T
91998-191-1.2-2.85.41.98.4
101999-13.3-10-10.3-2.812.63.212
11200025.511.712.9-6.2-5.3-10.70.8
1220012.3-6.7-5.87.10.92.2-0.7
13200213.299.413.126.5-20.73.3
14200313.824.823.6-5.725.55.6-3.6
15200420.121.521.4-12.712.3310.3
162005-5.92.92-4.1-10-2-4.3
17200617.90.31.9-7.8-8.18.48.5
18200715.5-1.60.2-1.4-1.7-2.410.1

ValuePeriodHousingAnnual Dataset IdentifierQuarterly Dataset IdentifierMonthly Dataset Identifier
-191998Public new housingMV5HMV54MV4O
11998Private new housingMV5IMV55MV4P
-13.31999Public new housingMV5HMV54MV4O
-101999Private new housingMV5IMV55MV4P
25.52000Public new housingMV5HMV54MV4O
11.72000Private new housingMV5IMV55MV4P
2.32001Public new housingMV5HMV54MV4O
-6.72001Private new housingMV5IMV55MV4P
13.22002Public new housingMV5HMV54MV4O
92002Private new housingMV5IMV55MV4P
13.82003Public new housingMV5HMV54MV4O
24.82003Private new housingMV5IMV55MV4P
20.12004Public new housingMV5HMV54MV4O
21.52004Private new housingMV5IMV55MV4P
-5.92005Public new housingMV5HMV54MV4O
2.92005Private new housingMV5IMV55MV4P
17.92006Public new housingMV5HMV54MV4O
0.32006Private new housingMV5IMV55MV4P
15.52007Public new housingMV5HMV54MV4O
-1.62007Private new housingMV5IMV55MV4P