TidyData: Within#
Here we document how to create a within relationship between the values of a column and observations.
The purpose of within is to allow you to extract data where the visual relationship to a single column varies depending on the observation in question.
Source Data#
The data source we’re using for these examples is shown below:
Note - this particular table has some very verbose headers we don’t care about, so we’ll be using |
---|
The full data source can be downloaded here.
from tidychef import acquire, preview
from tidychef.direction import right
from tidychef.selection import XlsSelectable
table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(sex)
Sex |
data
A | B | C | D | E | F | G | H | I | |
1 | |||||||||
2 | Date | Male | Female | ||||||
3 | 01 07 2023 | 2.0 | 3.0 | 23.0 | 4.0 | 12.0 | 7.0 | ||
4 | 02 07 2023 | 1.0 | 4.0 | 12.0 | 23.0 | 24.0 | 2.0 | ||
5 | 03 07 2023 | 3.0 | 3.0 | 9.0 | 5.0 | 4.0 | 16.0 | ||
6 | 04 07 2023 | 4.0 | 1.0 | 1.0 | 6.0 | 24.0 | 31.0 | ||
7 | 05 07 2023 | 5.0 | 21.0 | 1.0 | 56.0 | 512.0 | 12.0 | ||
8 | 06 07 2023 | 5.0 | 21.0 | 12.0 | 3.0 | 5.0 | 7.0 |
You’ll notice that the directional relationship to both “Male” and “Female” varies between closest:right
, directly:up
and closest:left
depending on the observation in question.
This scenario is what the within lookup is designed to solve.
Within Logic#
An example Column
constructor using within is shown below:
Column(sex.finds_observations_within(below, start=left(1), end=right(1)))
This means that the “sex” column values find observation below itself by scanning below looking from 1 column to the left through to one column to the right.
To put it another way, the column values “look” for observations in the order as indicated by the numbers in parenthesis - (1),(2), (3) etc.
C |
D |
E |
|
---|---|---|---|
1 |
|||
2 |
Male |
||
3 |
(1) |
(2) |
(3) |
4 |
(4) |
(5) |
(6) |
5 |
(7) |
(8) |
(9) |
etc…
You can also change he distance that is scanned as per the below
Column(sex.finds_observations_within(below, start=left(1), end=right(2)))
which would scan for observations as follows
C |
D |
E |
F |
|
---|---|---|---|---|
1 |
||||
2 |
Male |
|||
3 |
(1) |
(2) |
(3) |
(4) |
4 |
(5) |
(6) |
(7) |
(8) |
5 |
(9) |
(10) |
(11) |
(12) |
Now lets apply this to our example
from tidychef import acquire, preview
from tidychef.direction import right, down, left
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable
table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")
observations = table.excel_ref('C3').expand(right).expand(down).is_not_blank().label_as("Observations")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(observations, sex)
tidy_data = TidyData(
observations,
Column(sex.attach_within(down, start=left(1), end=right(1)))
)
print(tidy_data)
Observations |
Sex |
data
A | B | C | D | E | F | G | H | I | |
1 | |||||||||
2 | Date | Male | Female | ||||||
3 | 01 07 2023 | 2.0 | 3.0 | 23.0 | 4.0 | 12.0 | 7.0 | ||
4 | 02 07 2023 | 1.0 | 4.0 | 12.0 | 23.0 | 24.0 | 2.0 | ||
5 | 03 07 2023 | 3.0 | 3.0 | 9.0 | 5.0 | 4.0 | 16.0 | ||
6 | 04 07 2023 | 4.0 | 1.0 | 1.0 | 6.0 | 24.0 | 31.0 | ||
7 | 05 07 2023 | 5.0 | 21.0 | 1.0 | 56.0 | 512.0 | 12.0 | ||
8 | 06 07 2023 | 5.0 | 21.0 | 12.0 | 3.0 | 5.0 | 7.0 |
Observations | Sex |
2.0 | Male |
3.0 | Male |
23.0 | Male |
4.0 | Female |
12.0 | Female |
7.0 | Female |
1.0 | Male |
3.0 | Male |
4.0 | Male |
5.0 | Male |
5.0 | Male |
4.0 | Male |
3.0 | Male |
1.0 | Male |
21.0 | Male |
21.0 | Male |
12.0 | Male |
9.0 | Male |
1.0 | Male |
1.0 | Male |
12.0 | Male |
23.0 | Female |
5.0 | Female |
6.0 | Female |
56.0 | Female |
3.0 | Female |
24.0 | Female |
4.0 | Female |
24.0 | Female |
512.0 | Female |
5.0 | Female |
2.0 | Female |
16.0 | Female |
31.0 | Female |
12.0 | Female |
7.0 | Female |
Within: Understanding Direction Of Traversal#
the direction of traversal depends entirely on the start
and end
keyword argument, i.e:
down, start=left(1), end=right(1)
is scanning from left to right, so considers cells in the following order:
C |
D |
E |
|
---|---|---|---|
1 |
|||
2 |
Male |
||
3 |
(1) |
(2) |
(3) |
4 |
(4) |
(5) |
(6) |
5 |
(7) |
(8) |
(9) |
down, start=right(1), end=left(1)
is scanning from right to left, so the order is:
C |
D |
E |
|
---|---|---|---|
1 |
|||
2 |
Male |
||
3 |
(3) |
(2) |
(1) |
4 |
(6) |
(5) |
(4) |
5 |
(9) |
(8) |
(7) |
The order of traversal doesn’t matter in our example scenario here but there are scenarios in more complex data where it will.
Remember - each column value is resolved exactly once per observation, controlling the order of consideration gives you control where more than one column value could potentially be resolved for a given observation.
In simplest terms the order of traversal controls priority.
Consider the following data:
from tidychef import acquire, filters
from tidychef.selection import Selectable
# Note acquire.python.list_of_lists is a constructor that takes a python object
# it's intended purely for testing and/or quickly mocking up tabulated data
table: Selectable = acquire.python.list_of_lists(
[
["Both", "", "Male", "", "", "Female", ""],
["-", "Retired", "Working Age", "Child", "Retired", "Working Age", "Child"],
["13", "3", "1", "4", "2", "1", "2"],
["19", "3", "3", "5", "3", "4", "1"],
["27", "2", "4", "6", "4", "3", "4"],
["", "", "", "", "", "", ""],
]
)
preview(table)
Unnamed Table
A | B | C | D | E | F | G | |
1 | Both | Male | Female | ||||
2 | - | Retired | Working Age | Child | Retired | Working Age | Child |
3 | 13 | 3 | 1 | 4 | 2 | 1 | 2 |
4 | 19 | 3 | 3 | 5 | 3 | 4 | 1 |
5 | 27 | 2 | 4 | 6 | 4 | 3 | 4 |
6 |
Now lets try extracting it with start=left(1), end=right(1)
observations = table.filter(filters.is_numeric).label_as("Observations")
sex = table.excel_ref("1").is_not_blank().label_as("Sex")
preview(observations, sex)
tidy_data = TidyData(
observations,
Column(sex.attach_within(down, start=right(1), end=left(1)))
)
print(tidy_data)
Observations |
Sex |
Unnamed Table
A | B | C | D | E | F | G | |
1 | Both | Male | Female | ||||
2 | - | Retired | Working Age | Child | Retired | Working Age | Child |
3 | 13 | 3 | 1 | 4 | 2 | 1 | 2 |
4 | 19 | 3 | 3 | 5 | 3 | 4 | 1 |
5 | 27 | 2 | 4 | 6 | 4 | 3 | 4 |
6 |
Observations | Sex |
13 | Both |
3 | Both |
1 | Male |
4 | Male |
2 | Female |
1 | Female |
2 | Female |
19 | Both |
3 | Both |
3 | Male |
5 | Male |
3 | Female |
4 | Female |
1 | Female |
27 | Both |
2 | Both |
4 | Male |
6 | Male |
4 | Female |
3 | Female |
4 | Female |
Notice the values for “Both”! These include the leftmost values that should be categorised as Male. |
---|
This is purely a result of direction of travel. The observation cells are considered right->centre->left to resolve the correct column value.
As an example: cell B3 (value “3”) resolves to “Both” as its right of “Both” and left of “Male” but right is the priority, a la right->center->left.
Instead, lets define our within relationship with the opposite direction of travel so the obs are resolved left->center->right compared to the column values.
preview(observations, sex)
tidy_data = TidyData(
observations,
Column(sex.attach_within(down, start=left(1), end=right(1)))
)
print(tidy_data)
Observations |
Sex |
Unnamed Table
A | B | C | D | E | F | G | |
1 | Both | Male | Female | ||||
2 | - | Retired | Working Age | Child | Retired | Working Age | Child |
3 | 13 | 3 | 1 | 4 | 2 | 1 | 2 |
4 | 19 | 3 | 3 | 5 | 3 | 4 | 1 |
5 | 27 | 2 | 4 | 6 | 4 | 3 | 4 |
6 |
Observations | Sex |
13 | Both |
3 | Male |
1 | Male |
4 | Male |
2 | Female |
1 | Female |
2 | Female |
19 | Both |
3 | Male |
3 | Male |
5 | Male |
3 | Female |
4 | Female |
1 | Female |
27 | Both |
2 | Male |
4 | Male |
6 | Male |
4 | Female |
3 | Female |
4 | Female |
And so we get the extraction we wanted.