TidyData: Within

TidyData: Within#

Here we document how to create a within relationship between the values of a column and observations.

The purpose of within is to allow you to extract data where the visual relationship to a single column varies depending on the observation in question.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using `bounded=` to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here.

from tidychef import acquire, preview
from tidychef.direction import right
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(sex)

Sex

data

	A	B	C	D	E	F	G	H	I
1
2	Date			Male				Female
3	01 07 2023		2.0	3.0	23.0		4.0	12.0	7.0
4	02 07 2023		1.0	4.0	12.0		23.0	24.0	2.0
5	03 07 2023		3.0	3.0	9.0		5.0	4.0	16.0
6	04 07 2023		4.0	1.0	1.0		6.0	24.0	31.0
7	05 07 2023		5.0	21.0	1.0		56.0	512.0	12.0
8	06 07 2023		5.0	21.0	12.0		3.0	5.0	7.0

You’ll notice that the directional relationship to both “Male” and “Female” varies between closest:right, directly:up and closest:left depending on the observation in question.

This scenario is what the within lookup is designed to solve.

Within Logic#

An example Column constructor using within is shown below:

Column(sex.finds_observations_within(below, start=left(1), end=right(1)))

This means that the “sex” column values find observation below itself by scanning below looking from 1 column to the left through to one column to the right.

To put it another way, the column values “look” for observations in the order as indicated by the numbers in parenthesis - (1),(2), (3) etc.

	C	D	E
1
2		Male
3	(1)	(2)	(3)
4	(4)	(5)	(6)
5	(7)	(8)	(9)

etc…

You can also change he distance that is scanned as per the below

Column(sex.finds_observations_within(below, start=left(1), end=right(2)))

which would scan for observations as follows

	C	D	E	F
1
2		Male
3	(1)	(2)	(3)	(4)
4	(5)	(6)	(7)	(8)
5	(9)	(10)	(11)	(12)

Now lets apply this to our example

from tidychef import acquire, preview
from tidychef.direction import right, down, left
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")

observations = table.excel_ref('C3').expand(right).expand(down).is_not_blank().label_as("Observations")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=left(1), end=right(1)))
)
print(tidy_data)

Observations

Sex

data

	A	B	C	D	E	F	G	H	I
1
2	Date			Male				Female
3	01 07 2023		2.0	3.0	23.0		4.0	12.0	7.0
4	02 07 2023		1.0	4.0	12.0		23.0	24.0	2.0
5	03 07 2023		3.0	3.0	9.0		5.0	4.0	16.0
6	04 07 2023		4.0	1.0	1.0		6.0	24.0	31.0
7	05 07 2023		5.0	21.0	1.0		56.0	512.0	12.0
8	06 07 2023		5.0	21.0	12.0		3.0	5.0	7.0

Observations	Sex
2.0	Male
3.0	Male
23.0	Male
4.0	Female
12.0	Female
7.0	Female
1.0	Male
3.0	Male
4.0	Male
5.0	Male
5.0	Male
4.0	Male
3.0	Male
1.0	Male
21.0	Male
21.0	Male
12.0	Male
9.0	Male
1.0	Male
1.0	Male
12.0	Male
23.0	Female
5.0	Female
6.0	Female
56.0	Female
3.0	Female
24.0	Female
4.0	Female
24.0	Female
512.0	Female
5.0	Female
2.0	Female
16.0	Female
31.0	Female
12.0	Female
7.0	Female

Within: Understanding Direction Of Traversal#

the direction of traversal depends entirely on the start and end keyword argument, i.e:

down, start=left(1), end=right(1) is scanning from left to right, so considers cells in the following order:

	C	D	E
1
2		Male
3	(1)	(2)	(3)
4	(4)	(5)	(6)
5	(7)	(8)	(9)

down, start=right(1), end=left(1) is scanning from right to left, so the order is:

	C	D	E
1
2		Male
3	(3)	(2)	(1)
4	(6)	(5)	(4)
5	(9)	(8)	(7)

The order of traversal doesn’t matter in our example scenario here but there are scenarios in more complex data where it will.

Remember - each column value is resolved exactly once per observation, controlling the order of consideration gives you control where more than one column value could potentially be resolved for a given observation.

In simplest terms the order of traversal controls priority.

Consider the following data:

from tidychef import acquire, filters
from tidychef.selection import Selectable

# Note acquire.python.list_of_lists is a constructor that takes a python object
# it's intended purely for testing and/or quickly mocking up tabulated data
table: Selectable = acquire.python.list_of_lists(
    [
        ["Both", "", "Male", "", "", "Female", ""],
        ["-", "Retired", "Working Age", "Child", "Retired", "Working Age", "Child"],
        ["13", "3", "1", "4", "2", "1", "2"],
        ["19", "3", "3", "5", "3", "4", "1"],
        ["27", "2", "4", "6", "4", "3", "4"],
        ["", "", "", "", "", "", ""],
    ]
)

preview(table)

Unnamed Table

	A	B	C	D	E	F	G
1	Both		Male			Female
2	-	Retired	Working Age	Child	Retired	Working Age	Child
3	13	3	1	4	2	1	2
4	19	3	3	5	3	4	1
5	27	2	4	6	4	3	4
6

Now lets try extracting it with start=left(1), end=right(1)

observations = table.filter(filters.is_numeric).label_as("Observations")
sex = table.excel_ref("1").is_not_blank().label_as("Sex")
preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=right(1), end=left(1)))
)

print(tidy_data)

Observations

Sex

Unnamed Table

	A	B	C	D	E	F	G
1	Both		Male			Female
2	-	Retired	Working Age	Child	Retired	Working Age	Child
3	13	3	1	4	2	1	2
4	19	3	3	5	3	4	1
5	27	2	4	6	4	3	4
6

Observations	Sex
13	Both
3	Both
1	Male
4	Male
2	Female
1	Female
2	Female
19	Both
3	Both
3	Male
5	Male
3	Female
4	Female
1	Female
27	Both
2	Both
4	Male
6	Male
4	Female
3	Female
4	Female

Notice the values for “Both”! These include the leftmost values that should be categorised as Male.

This is purely a result of direction of travel. The observation cells are considered right->centre->left to resolve the correct column value.

As an example: cell B3 (value “3”) resolves to “Both” as its right of “Both” and left of “Male” but right is the priority, a la right->center->left.

Instead, lets define our within relationship with the opposite direction of travel so the obs are resolved left->center->right compared to the column values.

preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=left(1), end=right(1)))
)

print(tidy_data)

Observations

Sex

Unnamed Table

	A	B	C	D	E	F	G
1	Both		Male			Female
2	-	Retired	Working Age	Child	Retired	Working Age	Child
3	13	3	1	4	2	1	2
4	19	3	3	5	3	4	1
5	27	2	4	6	4	3	4
6

Observations	Sex
13	Both
3	Male
1	Male
4	Male
2	Female
1	Female
2	Female
19	Both
3	Male
3	Male
5	Male
3	Female
4	Female
1	Female
27	Both
2	Male
4	Male
6	Male
4	Female
3	Female
4	Female

And so we get the extraction we wanted.

TidyData: Within

Contents

TidyData: Within#

Source Data#

data

Within Logic#

data

Within: Understanding Direction Of Traversal#

Unnamed Table

Unnamed Table

Unnamed Table