Service Industry#

A small spreadsheet puclished by the UK Office for National Statistcs, makes heavy use of whitespace and inconsistent and fairly irregular spatial relationships between concepts. There’s also some fairly semantically lackluster things scattered throughout, i.e <date> -> (see cell A21).

So while neither particuarly big nor structually complex there are a few processing steps needed to make this source legible as tidy data.

Note - you could go into quite some detail capturing nuances of footnotes against values if needed, for our purposes here I’m keeping in the bounds of extracting the basic data table.

Tutorial Structure#

With these example tutorials I’m going to comment heavily and cover nuances in a follow up section (with liberal targetted previews as needed) as it’s the easiest way to grapple with new ideas. It may also be worth opening up these notebooks yourself (they’re in ./jupyterbook in the tidychef github repo) so you can run, alter and generally have a play about with this yourself as part of your learning.

We’ll cover:

  • source data

  • requirements, what we’re aiming to do here

  • show the full script (all logic commented)

  • output the selection preview

  • nuances (where applicable)

  • view the output

This sequencing is necessary as the output for some of the example is really long so that necessitates it coming last. If you’re viewing this via a jupyter book (i.e on the site) you can navigate between the above sections via your right hand menu.

Note - these tutorial scripts might seem verbose due all the comments but that’s ok (this is a tutorial after all). If you take them out you end up with a fairly succinct and human readable encapsulation of what would otherwise (with existing tools) be a rather convoluted and fragile set of instructions to express.

In virtually all cases I’ll make heavy use of preview and bounded to only look at relevant parts of what can be quite large datasets. Downloads links are provided for the source data.

Source#

For this example we’re extracting the table “TOPS19” as shown below (note - preview cropped for reasons of practicality):

from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/service-industry.xls", tables="TOPSI9")
preview(table, bounded="A1:Q22")

TOPSI9

ABCDEFGHIJKLMNOPQ
1TOPSI9UK Production Turnover
2Turnover in Production and Services Industries
3Current price, not seasonally adjusted£ million
4Back to ContentsManufacture of air and spacecraft and related machinery
5Building of ships and boatsManufacture of other transport equipment
6Manufacture of furnitureOther manufacturing
7
830.130.330.2/4/9 (30OTHER)3132
9JQR4JQS8JQU4JQV2JQV5
10
112012.04787.621632.82162.26722.58784.8
122013.04484.824556.82487.06821.29217.7
132014.04656.223227.92417.07661.39422.9
142015.04774.625425.32293.08011.88830.2
152016.04540.627401.92381.18472.39902.8
16
17
182015.0Q31106.15633.2579.22032.22194.0
19Q41216.97392.4543.32075.82248.9
20
212016 ->Q1964.45697.7603.52014.82488.6
22Q21623.86897.4637.52070.62433.5

From an xlsx source which can be downloaded here.

Requirements#

  • We’ll take the line 4 and 5 headers as “Production”.

  • Time is an add amalgamation of columns A+B so we’ll tidy up and concatentate them.

  • We’ll take row 9 as “CDID” (as I happen to know that’s the name of this particular type of identifier).

  • We’ll take the main observations as “Observations”

  • We’ll clean off the sub table at the bottom as unneccesary for our purpose here.

from typing import List
from tidychef import acquire, against, preview, filters
from tidychef.direction import up, down, left, right
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/service-industry.xls", tables="TOPSI9")

# We have an extended footer section (almost a separate table) under the main table, we're gonna select it here
# just so we can remove it from the other selections
unwanted = table.excel_ref('A').filter(filters.contains_string("Average")).expand(right).expand(down)

# - (a) Find the row with a cell containing the string "ships"
# - (b) Extrude (expand but only one row) the selection down
# - (c) Extrude up we now have the ships row along with the row above and below it
# - (d) remove any blank cells from the selection
product = table.cell_containing_string("ships",strict=False).extrude(down).extrude(up).expand(right).is_not_blank().label_as("Producton")

# -----------------
# Approach to time:
# We're gonna take time as "year" and "time_qualifier" (quarter, month or blank) and concatentate them
# in the TidyData class, to give us a "Time" column of time_qualifer+year e.g '2019', 'Q1 2019', 'Jul 2019'.
# -----------------

# Get numeric values from column A.Expand down from them (to sweep up the odd -> annotations) and remove
# anything we might have swept up from the footer section
year = table.column('A').is_numeric().expand(down).is_not_blank().label_as("Year") - unwanted

# We'll cover this in nuances
time_qualifier = year.shift(right).expand(down).is_not_blank().label_as("Time Qualifier") - unwanted | year.shift(right)

# Get rows 8-10 then use a regex to find the CDID's (cells with four upper case characters)
# We're taking the rows 8,10 (currently wither side of the row we want) as a bit of future proofing
# against layout changes.
cdid = table.excel_ref('8:10').re(r"^[A-Z]{3}\d$").assert_single_row().label_as("CDID")

# Use a waffle to the observations.
observations = cdid.waffle(down, time_qualifier).is_not_blank().label_as("Observations")

# Create a bounded preview inline but also write the full preview to path
preview(product, year, time_qualifier, cdid, observations)

tidy_data = TidyData(
    observations,
    Column(product.attach_directly(down)),
    Column(year.attach_closest(down), apply=lambda x: x[:4], validate=against.is_numeric), # lets both strip and make sure we've stripped the '->'
    Column(time_qualifier.attach_directly(right)),
    Column(cdid.attach_directly(down)),
    Column.horizontal_condition("Time", lambda x: (x["Time Qualifier"] + " " + x["Year"]).strip()),
    drop=["Year", "Time Qualifier"]
)

tidy_data.to_csv("service-industry.csv")
Producton
Year
Time Qualifier
CDID
Observations

TOPSI9

ABCDEFGHIJKLMNOPQRSTU
1TOPSI9UK Production Turnover
2Turnover in Production and Services Industries
3Current price, not seasonally adjusted£ million
4Back to ContentsManufacture of air and spacecraft and related machinery
5Building of ships and boatsManufacture of other transport equipment
6Manufacture of furnitureOther manufacturing
7
830.130.330.2/4/9 (30OTHER)3132
9JQR4JQS8JQU4JQV2JQV5
10
112012.04787.621632.82162.26722.58784.8
122013.04484.824556.82487.06821.29217.7
132014.04656.223227.92417.07661.39422.9
142015.04774.625425.32293.08011.88830.2
152016.04540.627401.92381.18472.39902.8
16
17
182015.0Q31106.15633.2579.22032.22194.0
19Q41216.97392.4543.32075.82248.9
20
212016 ->Q1964.45697.7603.52014.82488.6
22Q21623.86897.4637.52070.62433.5
23Q3824.36443.6563.52232.62407.6
24Q41128.18363.2576.62154.32573.1
25
262017.0Q11124.46984.0646.42194.92722.0
27Q21510.77904.9706.22081.72406.6
28Q31216.67113.8672.72262.12562.4
29
30
312015.0Jul338.71913.5201.0714.3738.7
32Aug386.11715.4175.7611.8644.9
33Sep381.32004.3202.5706.1810.4
34Oct324.92243.1178.8712.9756.6
35Nov339.62113.0172.9766.4757.9
36Dec552.43036.3191.6596.5734.4
37
382016 ->Jan233.01579.7146.9577.3772.0
39Feb251.31801.9195.0686.3844.6
40Mar480.12316.1261.6751.2872.0
41Apr920.01896.1180.8701.1818.4
42May356.02401.2196.3680.9789.9
43Jun347.82600.1260.4688.6825.2
44
45Jul233.22098.4170.7704.6789.3
46Aug227.02127.5164.4751.3755.3
47Sep364.12217.7228.4776.7863.0
48Oct254.22557.6201.5776.1841.6
49Nov312.92493.2166.8793.7908.1
50Dec561.03312.4208.3584.5823.4
51
522017.0Jan229.11742.7193.0662.8839.8
53Feb319.62344.4202.0718.7869.1
54Mar575.72896.9251.4813.41013.1
55Apr363.02072.1213.8659.9736.7
56May322.62622.0233.8717.2814.1
57Jun825.13210.8258.6704.6855.8
58
59Jul310.62172.8222.4710.4837.3
60Aug470.82281.0206.0788.0823.9
61Sep435.22660.0244.3763.7901.2
62
63
64Average for 3 months up to
65
662016 Sep274.82147.9187.8744.2802.5
67
68Average for 3 months up to
69
702017 Sep405.52371.3224.2754.0854.1
71
72Percentage change1 to
73
742017 Sep47.610.419.41.36.4
75
76† Indicates the earliest period in the series that has been revised.
77-> Marks an increase in data coverage of ourBusiness Register, to include a population of solely PAYE based businesses.
78 Industry group estimates most affected in these tables: 14; 16; 56; 68.3; 69.1; 69.2; 70.2; 74; 78; 81.2; 82; 86, and 96.01-03
791 Percentages are based on values rounded to four decimal places
80
81
82
83
84

Nuances#

Extracting the time qualifer#

I mainly wanted to detail this as it’s a good example of using union during an extraction to handle an awkward scenario. Here’s the line in question:

time_qualifier = year.shift(right).expand(down).is_not_blank().label_as("time2") - unwanted | year.shift(right)

Now we’ll step through what’s actually happening here.

# Lets get the left side of the selection
examp1e1 = year.shift(right).expand(down).is_not_blank().label_as("Time Qualifier - left side of union") - unwanted
preview(examp1e1, bounded="A5:H22")
Time Qualifier - left side of union

TOPSI9

ABCDEFGH
5Building of ships and boats
6
7
830.130.3
9JQR4JQS8
10
112012.04787.621632.8
122013.04484.824556.8
132014.04656.223227.9
142015.04774.625425.3
152016.04540.627401.9
16
17
182015.0Q31106.15633.2
19Q41216.97392.4
20
212016 ->Q1964.45697.7
22Q21623.86897.4

So this gives us some but not all of the selections we need.

Notice we’re missing a selection for B11:B15. All columns must have a cell to lookup, even if that cell is blank (blank is fine "" + year just ends up with year, which is the right result in this scenario). So we need to find a way to add those missing four blank cells to the selection.

Which brings us the right side of the union statement - lets see what it resolves to.

right_side_of_union = year.shift(right).label_as("Time Qualifier - right side of union") 
preview(right_side_of_union, bounded="A5:H22")
Time Qualifier - right side of union

TOPSI9

ABCDEFGH
5Building of ships and boats
6
7
830.130.3
9JQR4JQS8
10
112012.04787.621632.8
122013.04484.824556.8
132014.04656.223227.9
142015.04774.625425.3
152016.04540.627401.9
16
17
182015.0Q31106.15633.2
19Q41216.97392.4
20
212016 ->Q1964.45697.7
22Q21623.86897.4

So this also gives us some but not all of what we need.

So we need both, so (to bring us full circle) we need a union of the cells in these selctions, hence:

time_qualifier = year.shift(right).expand(down).is_not_blank().label_as("Time Qualifier") - unwanted | year.shift(right)
preview(time_qualifier, bounded="A5:H22")
Time Qualifier

TOPSI9

ABCDEFGH
5Building of ships and boats
6
7
830.130.3
9JQR4JQS8
10
112012.04787.621632.8
122013.04484.824556.8
132014.04656.223227.9
142015.04774.625425.3
152016.04540.627401.9
16
17
182015.0Q31106.15633.2
19Q41216.97392.4
20
212016 ->Q1964.45697.7
22Q21623.86897.4

Outputs#

The tidy data can be view here and a full inline preview of the tidydata generated is shown below for those people who’d prefer to scroll.

print(tidy_data)
ObservationsProductonCDIDTime
4787.6Building of ships and boatsJQR42012
21632.8Manufacture of air and spacecraft and related machineryJQS82012
2162.2Manufacture of other transport equipmentJQU42012
6722.5Manufacture of furnitureJQV22012
8784.8Other manufacturingJQV52012
4484.8Building of ships and boatsJQR42013
24556.8Manufacture of air and spacecraft and related machineryJQS82013
2487.0Manufacture of other transport equipmentJQU42013
6821.2Manufacture of furnitureJQV22013
9217.7Other manufacturingJQV52013
4656.2Building of ships and boatsJQR42014
23227.9Manufacture of air and spacecraft and related machineryJQS82014
2417.0Manufacture of other transport equipmentJQU42014
7661.3Manufacture of furnitureJQV22014
9422.9Other manufacturingJQV52014
4774.6Building of ships and boatsJQR42015
25425.3Manufacture of air and spacecraft and related machineryJQS82015
2293.0Manufacture of other transport equipmentJQU42015
8011.8Manufacture of furnitureJQV22015
8830.2Other manufacturingJQV52015
4540.6Building of ships and boatsJQR42016
27401.9Manufacture of air and spacecraft and related machineryJQS82016
2381.1Manufacture of other transport equipmentJQU42016
8472.3Manufacture of furnitureJQV22016
9902.8Other manufacturingJQV52016
1106.1Building of ships and boatsJQR4Q3 2015
5633.2Manufacture of air and spacecraft and related machineryJQS8Q3 2015
579.2Manufacture of other transport equipmentJQU4Q3 2015
2032.2Manufacture of furnitureJQV2Q3 2015
2194.0Other manufacturingJQV5Q3 2015
1216.9Building of ships and boatsJQR4Q4 2015
7392.4Manufacture of air and spacecraft and related machineryJQS8Q4 2015
543.3Manufacture of other transport equipmentJQU4Q4 2015
2075.8Manufacture of furnitureJQV2Q4 2015
2248.9Other manufacturingJQV5Q4 2015
964.4Building of ships and boatsJQR4Q1 2016
5697.7Manufacture of air and spacecraft and related machineryJQS8Q1 2016
603.5Manufacture of other transport equipmentJQU4Q1 2016
2014.8Manufacture of furnitureJQV2Q1 2016
2488.6Other manufacturingJQV5Q1 2016
1623.8Building of ships and boatsJQR4Q2 2016
6897.4Manufacture of air and spacecraft and related machineryJQS8Q2 2016
637.5Manufacture of other transport equipmentJQU4Q2 2016
2070.6Manufacture of furnitureJQV2Q2 2016
2433.5Other manufacturingJQV5Q2 2016
824.3Building of ships and boatsJQR4Q3 2016
6443.6Manufacture of air and spacecraft and related machineryJQS8Q3 2016
563.5Manufacture of other transport equipmentJQU4Q3 2016
2232.6Manufacture of furnitureJQV2Q3 2016
2407.6Other manufacturingJQV5Q3 2016
1128.1Building of ships and boatsJQR4Q4 2016
8363.2Manufacture of air and spacecraft and related machineryJQS8Q4 2016
576.6Manufacture of other transport equipmentJQU4Q4 2016
2154.3Manufacture of furnitureJQV2Q4 2016
2573.1Other manufacturingJQV5Q4 2016
1124.4Building of ships and boatsJQR4Q1 2017
6984.0Manufacture of air and spacecraft and related machineryJQS8Q1 2017
646.4Manufacture of other transport equipmentJQU4Q1 2017
2194.9Manufacture of furnitureJQV2Q1 2017
2722.0Other manufacturingJQV5Q1 2017
1510.7Building of ships and boatsJQR4Q2 2017
7904.9Manufacture of air and spacecraft and related machineryJQS8Q2 2017
706.2Manufacture of other transport equipmentJQU4Q2 2017
2081.7Manufacture of furnitureJQV2Q2 2017
2406.6Other manufacturingJQV5Q2 2017
1216.6Building of ships and boatsJQR4Q3 2017
7113.8Manufacture of air and spacecraft and related machineryJQS8Q3 2017
672.7Manufacture of other transport equipmentJQU4Q3 2017
2262.1Manufacture of furnitureJQV2Q3 2017
2562.4Other manufacturingJQV5Q3 2017
338.7Building of ships and boatsJQR4Jul 2015
1913.5Manufacture of air and spacecraft and related machineryJQS8Jul 2015
201.0Manufacture of other transport equipmentJQU4Jul 2015
714.3Manufacture of furnitureJQV2Jul 2015
738.7Other manufacturingJQV5Jul 2015
386.1Building of ships and boatsJQR4Aug 2015
1715.4Manufacture of air and spacecraft and related machineryJQS8Aug 2015
175.7Manufacture of other transport equipmentJQU4Aug 2015
611.8Manufacture of furnitureJQV2Aug 2015
644.9Other manufacturingJQV5Aug 2015
381.3Building of ships and boatsJQR4Sep 2015
2004.3Manufacture of air and spacecraft and related machineryJQS8Sep 2015
202.5Manufacture of other transport equipmentJQU4Sep 2015
706.1Manufacture of furnitureJQV2Sep 2015
810.4Other manufacturingJQV5Sep 2015
324.9Building of ships and boatsJQR4Oct 2015
2243.1Manufacture of air and spacecraft and related machineryJQS8Oct 2015
178.8Manufacture of other transport equipmentJQU4Oct 2015
712.9Manufacture of furnitureJQV2Oct 2015
756.6Other manufacturingJQV5Oct 2015
339.6Building of ships and boatsJQR4Nov 2015
2113.0Manufacture of air and spacecraft and related machineryJQS8Nov 2015
172.9Manufacture of other transport equipmentJQU4Nov 2015
766.4Manufacture of furnitureJQV2Nov 2015
757.9Other manufacturingJQV5Nov 2015
552.4Building of ships and boatsJQR4Dec 2015
3036.3Manufacture of air and spacecraft and related machineryJQS8Dec 2015
191.6Manufacture of other transport equipmentJQU4Dec 2015
596.5Manufacture of furnitureJQV2Dec 2015
734.4Other manufacturingJQV5Dec 2015
233.0Building of ships and boatsJQR4Jan 2016
1579.7Manufacture of air and spacecraft and related machineryJQS8Jan 2016
146.9Manufacture of other transport equipmentJQU4Jan 2016
577.3Manufacture of furnitureJQV2Jan 2016
772.0Other manufacturingJQV5Jan 2016
251.3Building of ships and boatsJQR4Feb 2016
1801.9Manufacture of air and spacecraft and related machineryJQS8Feb 2016
195.0Manufacture of other transport equipmentJQU4Feb 2016
686.3Manufacture of furnitureJQV2Feb 2016
844.6Other manufacturingJQV5Feb 2016
480.1Building of ships and boatsJQR4Mar 2016
2316.1Manufacture of air and spacecraft and related machineryJQS8Mar 2016
261.6Manufacture of other transport equipmentJQU4Mar 2016
751.2Manufacture of furnitureJQV2Mar 2016
872.0Other manufacturingJQV5Mar 2016
920.0Building of ships and boatsJQR4Apr 2016
1896.1Manufacture of air and spacecraft and related machineryJQS8Apr 2016
180.8Manufacture of other transport equipmentJQU4Apr 2016
701.1Manufacture of furnitureJQV2Apr 2016
818.4Other manufacturingJQV5Apr 2016
356.0Building of ships and boatsJQR4May 2016
2401.2Manufacture of air and spacecraft and related machineryJQS8May 2016
196.3Manufacture of other transport equipmentJQU4May 2016
680.9Manufacture of furnitureJQV2May 2016
789.9Other manufacturingJQV5May 2016
347.8Building of ships and boatsJQR4Jun 2016
2600.1Manufacture of air and spacecraft and related machineryJQS8Jun 2016
260.4Manufacture of other transport equipmentJQU4Jun 2016
688.6Manufacture of furnitureJQV2Jun 2016
825.2Other manufacturingJQV5Jun 2016
233.2Building of ships and boatsJQR4Jul 2016
2098.4Manufacture of air and spacecraft and related machineryJQS8Jul 2016
170.7Manufacture of other transport equipmentJQU4Jul 2016
704.6Manufacture of furnitureJQV2Jul 2016
789.3Other manufacturingJQV5Jul 2016
227.0Building of ships and boatsJQR4Aug 2016
2127.5Manufacture of air and spacecraft and related machineryJQS8Aug 2016
164.4Manufacture of other transport equipmentJQU4Aug 2016
751.3Manufacture of furnitureJQV2Aug 2016
755.3Other manufacturingJQV5Aug 2016
364.1Building of ships and boatsJQR4Sep 2016
2217.7Manufacture of air and spacecraft and related machineryJQS8Sep 2016
228.4Manufacture of other transport equipmentJQU4Sep 2016
776.7Manufacture of furnitureJQV2Sep 2016
863.0Other manufacturingJQV5Sep 2016
254.2Building of ships and boatsJQR4Oct 2016
2557.6Manufacture of air and spacecraft and related machineryJQS8Oct 2016
201.5Manufacture of other transport equipmentJQU4Oct 2016
776.1Manufacture of furnitureJQV2Oct 2016
841.6Other manufacturingJQV5Oct 2016
312.9Building of ships and boatsJQR4Nov 2016
2493.2Manufacture of air and spacecraft and related machineryJQS8Nov 2016
166.8Manufacture of other transport equipmentJQU4Nov 2016
793.7Manufacture of furnitureJQV2Nov 2016
908.1Other manufacturingJQV5Nov 2016
561.0Building of ships and boatsJQR4Dec 2016
3312.4Manufacture of air and spacecraft and related machineryJQS8Dec 2016
208.3Manufacture of other transport equipmentJQU4Dec 2016
584.5Manufacture of furnitureJQV2Dec 2016
823.4Other manufacturingJQV5Dec 2016
229.1Building of ships and boatsJQR4Jan 2017
1742.7Manufacture of air and spacecraft and related machineryJQS8Jan 2017
193.0Manufacture of other transport equipmentJQU4Jan 2017
662.8Manufacture of furnitureJQV2Jan 2017
839.8Other manufacturingJQV5Jan 2017
319.6Building of ships and boatsJQR4Feb 2017
2344.4Manufacture of air and spacecraft and related machineryJQS8Feb 2017
202.0Manufacture of other transport equipmentJQU4Feb 2017
718.7Manufacture of furnitureJQV2Feb 2017
869.1Other manufacturingJQV5Feb 2017
575.7Building of ships and boatsJQR4Mar 2017
2896.9Manufacture of air and spacecraft and related machineryJQS8Mar 2017
251.4Manufacture of other transport equipmentJQU4Mar 2017
813.4Manufacture of furnitureJQV2Mar 2017
1013.1Other manufacturingJQV5Mar 2017
363.0Building of ships and boatsJQR4Apr 2017
2072.1Manufacture of air and spacecraft and related machineryJQS8Apr 2017
213.8Manufacture of other transport equipmentJQU4Apr 2017
659.9Manufacture of furnitureJQV2Apr 2017
736.7Other manufacturingJQV5Apr 2017
322.6Building of ships and boatsJQR4May 2017
2622.0Manufacture of air and spacecraft and related machineryJQS8May 2017
233.8Manufacture of other transport equipmentJQU4May 2017
717.2Manufacture of furnitureJQV2May 2017
814.1Other manufacturingJQV5May 2017
825.1Building of ships and boatsJQR4Jun 2017
3210.8Manufacture of air and spacecraft and related machineryJQS8Jun 2017
258.6Manufacture of other transport equipmentJQU4Jun 2017
704.6Manufacture of furnitureJQV2Jun 2017
855.8Other manufacturingJQV5Jun 2017
310.6Building of ships and boatsJQR4Jul 2017
2172.8Manufacture of air and spacecraft and related machineryJQS8Jul 2017
222.4Manufacture of other transport equipmentJQU4Jul 2017
710.4Manufacture of furnitureJQV2Jul 2017
837.3Other manufacturingJQV5Jul 2017
470.8Building of ships and boatsJQR4Aug 2017
2281.0Manufacture of air and spacecraft and related machineryJQS8Aug 2017
206.0Manufacture of other transport equipmentJQU4Aug 2017
788.0Manufacture of furnitureJQV2Aug 2017
823.9Other manufacturingJQV5Aug 2017
435.2Building of ships and boatsJQR4Sep 2017
2660.0Manufacture of air and spacecraft and related machineryJQS8Sep 2017
244.3Manufacture of other transport equipmentJQU4Sep 2017
763.7Manufacture of furnitureJQV2Sep 2017
901.2Other manufacturingJQV5Sep 2017