Coronavirus: Loneliness#

An irregular shaped spreadsheet dataset originally published by the UK Office for National Statistics.

Tutorial Structure#

With these example tutorials I’m going to comment heavily and cover nuances in a follow up section (with liberal targetted previews as needed) as it’s the easiest way to grapple with new ideas. It may also be worth opening up these notebooks yourself (they’re in ./jupyterbook in the tidychef github repo) so you can run, alter and generally have a play about with this yourself as part of your learning.

We’ll cover:

  • source data

  • requirements, what we’re aiming to do here

  • show the full script (all logic commented)

  • output the selection preview

  • nuances (where applicable)

  • view the output

This sequencing is necessary as the output for some of the example is really long so that necessitates it coming last. If you’re viewing this via a jupyter book (i.e on the site) you can navigate between the above sections via your right hand menu.

Note - these tutorial scripts might seem verbose due all the comments but that’s ok (this is a tutorial after all). If you take them out you end up with a fairly succinct and human readable encapsulation of what would otherwise (with existing tools) be a rather convoluted and fragile set of instructions to express.

In virtually all cases I’ll make heavy use of preview and bounded to only look at relevant parts of what can be quite large datasets. Downloads links are provided for the source data.

Source#

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xlsx/coronavirusandlonelinessreferencetable.xlsx", tables="1.4 Loneliness by sex")
preview(table, bounded="A1:F24")

1.4 Loneliness by sex

ABCDEF
1Proportion of people who feel lonely, by sex, Great Britain 3rd April to 3rd May 2020
2
3Definitions: Feeling often/ always lonely Adults aged 16 and over across Great Britain were asked how often they felt lonely, where the options were "often/always" / "some of the time" / "occasionally" / "hardly ever" / "never". Those reporting feeling lonely often / always is used as the indicator. Feeling lonely in the past 7 days Adults aged 16 and over across Great Britain were asked "In the past seven days, how has your well-being being affected?". This indicator is measured by those then reporting "Feeling lonely".
4
5Percentage
6MaleFemaleAll People
7Often/Always5.14.95
8Lower 95% Confidence Interval 3.73.94.1
9Upper 95% Confidence Interval 6.665.9
10
11Sample size (rounded)240028605260
12Percentage
13MaleFemaleAll People
14Felt lonely in past 7 days29.831.830.9
15Lower 95% Confidence Interval 25.828.528.4
16Upper 95% Confidence Interval 33.835.133.5
17
18Sample size (rounded)100014502440
19Source: Opinions and Lifestyle Survey, Office for National Statistics
20
21Notes:
221. All respondents were asked "How often do you feel lonely?". Therefore the base population for those reporting often/ always lonely is those aged 16 and over in Great Britain (n= 5,260)
232. Respondents were asked 'In the past seven days, how has your well-being being affected?' giving the option for noting 'Feeling lonely' if they have already noted that they were 'very worried' or 'somewhat worried' about the 'effect that Coronavirus (COVID-19) is having on their life right now?' and if they reported 'My well-being is being affected (for example, boredom, loneliness, anxiety and stress)' when asked "In which ways is Coronavirus (COVID-19) affecting your life?" (n= 2,440)
24

From an xlsx source which can be downloaded here.

Requirements#

  • We’ll take the priciple description of lonliness (Often/Always etc) as a “Lonliness” column.

  • We’ll take sex as a “Sex” column.

  • Sample size will be taken as a “Sample Size” column.

  • We’ll add a unit of measure so our data makes sense.

One the whole thiis is a particularly good example or a very irregularly presented table, in terms of readibility this was clearly very much constructed for a human not a machine to read.

from tidychef import acquire, preview
from tidychef.direction import right, down, left, up
from tidychef.output import TidyData, Column
from tidychef.selection import XlsxSelectable

# Get the data via the acquire api
table: XlsxSelectable = acquire.xlsx.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xlsx/coronavirusandlonelinessreferencetable.xlsx", tables="1.4 Loneliness by sex")

# Lets find the unwanted footer information so we can dicount it where needed
# - (a) select the cell with the text "Source: " in it
# - (b) confirm there's only one of those
# - (c) expand to a box selection (think "hold clicking a cell and dragging down right" in excel)
unwanted_footer = table.cell_containing_string("Source: ", strict=False).expand_to_box()
table = table - unwanted_footer

# - (a) get the column with Male in it
# - (b) narrow it down to just specifc cells from that column with Male in them
# - (c) extend/extrude the selection to inlcude two cells to the right of each
sex = table.column_containing_strings(["Male"]).re("Male").extrude(right(2)).label_as("Sex")

# - (a) filter the data to cells with the string "Sample size" in them
# - (b) confirm there's just two of those
# - (c) select all cells located to the right of those two
# - (d) desregard any blank cells you just selected
sample_size = table.cells_containing_string("Sample size", strict=False).assert_len(2).fill(right).is_not_blank().label_as("Sample Size")

# For observations, we just want all numeric values that aren't the sample size
observations = table.is_numeric().label_as("Observations") - sample_size

# Loneliness is anything to the left of an observation cell thats not a number
loneliness = observations.fill(left).is_not_numeric()

# Create a bounded preview inline but also write the full preview to path
preview(observations, sex, sample_size, loneliness, bounded="A1:F20")
preview(observations, sex, sample_size, loneliness, path="loneliness.html")

# Construct the tidydata table
tidy_data = TidyData(
    observations, 
    Column(sex.attach_directly(down)),
    Column(sample_size.attach_directly(up)),
    Column(loneliness.attach_directly(right)),
    Column.constant("Unit of measure", "Percentage")
)

tidy_data.to_csv("loneliness.csv")
Observations
Sex
Sample Size
Observations

1.4 Loneliness by sex

ABCDEF
1Proportion of people who feel lonely, by sex, Great Britain 3rd April to 3rd May 2020
2
3Definitions: Feeling often/ always lonely Adults aged 16 and over across Great Britain were asked how often they felt lonely, where the options were "often/always" / "some of the time" / "occasionally" / "hardly ever" / "never". Those reporting feeling lonely often / always is used as the indicator. Feeling lonely in the past 7 days Adults aged 16 and over across Great Britain were asked "In the past seven days, how has your well-being being affected?". This indicator is measured by those then reporting "Feeling lonely".
4
5Percentage
6MaleFemaleAll People
7Often/Always5.14.95
8Lower 95% Confidence Interval 3.73.94.1
9Upper 95% Confidence Interval 6.665.9
10
11Sample size (rounded)240028605260
12Percentage
13MaleFemaleAll People
14Felt lonely in past 7 days29.831.830.9
15Lower 95% Confidence Interval 25.828.528.4
16Upper 95% Confidence Interval 33.835.133.5
17
18Sample size (rounded)100014502440
19Source: Opinions and Lifestyle Survey, Office for National Statistics
20

Nuances#

Subtracting junk text#

There’s a nice example here if why tidychef supports the substraction of one selection from another. Human first tables have a nasty habit of including all sorts of ad hoc footers and side comments that don’t fit into any sort of regular structured data view of a resource.

unwanted_footer = table.filter(filters.contains_string("Source: ")).assert_one().expand_to_box()

which allows us to casually disregard the junk text later on as needed, i.e

loneliness = loneliness - unwanted_footer # clear it up a bit

Sex - why the double ‘Male’ selection?#

This probably seems odd at first. The key point here is re is slow.

So this would work:

sex = table.re("Male").extrude(right(2)).label_as("Sex")

but this is smarter

sex = table.column_containing_strings(["Male"]).re("Male").extrude(right(2)).label_as("Sex")

…in complete honestly, “slow” is relative and unless you’re running re against 10’s of thousands of cells it’s probably fine, even then its minutes not hours -neverthelss- it is a lot slower than the other standard methods so I wanted to make the point. It’s good practice to always aim to tighten your selection prior to re where it’s viable to.

Output#

The full preview can be viewed here.

The tidy data can be downloaded here and a full inline preview of the tidydata generated is shown below.

print(tidy_data)
ObservationsSexSample SizeObservationsUnit of measure
Often/AlwaysMale2400Often/AlwaysPercentage
Often/AlwaysFemale2860Often/AlwaysPercentage
Often/AlwaysAll People5260Often/AlwaysPercentage
Lower 95% Confidence Interval Male2400Lower 95% Confidence Interval Percentage
Lower 95% Confidence Interval Female2860Lower 95% Confidence Interval Percentage
Lower 95% Confidence Interval All People5260Lower 95% Confidence Interval Percentage
Upper 95% Confidence Interval Male2400Upper 95% Confidence Interval Percentage
Upper 95% Confidence Interval Female2860Upper 95% Confidence Interval Percentage
Upper 95% Confidence Interval All People5260Upper 95% Confidence Interval Percentage
Felt lonely in past 7 daysMale1000Felt lonely in past 7 daysPercentage
Felt lonely in past 7 daysFemale1450Felt lonely in past 7 daysPercentage
Felt lonely in past 7 daysAll People2440Felt lonely in past 7 daysPercentage
Lower 95% Confidence Interval Male1000Lower 95% Confidence Interval Percentage
Lower 95% Confidence Interval Female1450Lower 95% Confidence Interval Percentage
Lower 95% Confidence Interval All People2440Lower 95% Confidence Interval Percentage
Upper 95% Confidence Interval Male1000Upper 95% Confidence Interval Percentage
Upper 95% Confidence Interval Female1450Upper 95% Confidence Interval Percentage
Upper 95% Confidence Interval All People2440Upper 95% Confidence Interval Percentage