TidyData: Within#

Here we document how to create a within relationship between the values of a column and observations.

The purpose of within is to allow you to extract data where the visual relationship to a single column varies depending on the observation in question.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here.

from tidychef import acquire, preview
from tidychef.direction import right
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(sex)
Sex

data

ABCDEFGHI
1
2DateMaleFemale
301 07 20232.03.023.04.012.07.0
402 07 20231.04.012.023.024.02.0
503 07 20233.03.09.05.04.016.0
604 07 20234.01.01.06.024.031.0
705 07 20235.021.01.056.0512.012.0
806 07 20235.021.012.03.05.07.0

You’ll notice that the directional relationship to both “Male” and “Female” varies between closest:right, directly:up and closest:left depending on the observation in question.

This scenario is what the within lookup is designed to solve.

Within Logic#

An example Column constructor using within is shown below:

Column(sex.finds_observations_within(below, start=left(1), end=right(1)))

This means that the “sex” column values find observation below itself by scanning below looking from 1 column to the left through to one column to the right.

To put it another way, the column values “look” for observations in the order as indicated by the numbers in parenthesis - (1),(2), (3) etc.

C

D

E

1

2

Male

3

(1)

(2)

(3)

4

(4)

(5)

(6)

5

(7)

(8)

(9)

etc…

You can also change he distance that is scanned as per the below

Column(sex.finds_observations_within(below, start=left(1), end=right(2)))

which would scan for observations as follows

C

D

E

F

1

2

Male

3

(1)

(2)

(3)

(4)

4

(5)

(6)

(7)

(8)

5

(9)

(10)

(11)

(12)

Now lets apply this to our example

from tidychef import acquire, preview
from tidychef.direction import right, down, left
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/within-sample.xls", tables="data")

observations = table.excel_ref('C3').expand(right).expand(down).is_not_blank().label_as("Observations")
sex = table.excel_ref('D2').expand(right).is_not_blank().label_as("Sex")
preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=left(1), end=right(1)))
)
print(tidy_data)
Observations
Sex

data

ABCDEFGHI
1
2DateMaleFemale
301 07 20232.03.023.04.012.07.0
402 07 20231.04.012.023.024.02.0
503 07 20233.03.09.05.04.016.0
604 07 20234.01.01.06.024.031.0
705 07 20235.021.01.056.0512.012.0
806 07 20235.021.012.03.05.07.0

ObservationsSex
2.0Male
3.0Male
23.0Male
4.0Female
12.0Female
7.0Female
1.0Male
3.0Male
4.0Male
5.0Male
5.0Male
4.0Male
3.0Male
1.0Male
21.0Male
21.0Male
12.0Male
9.0Male
1.0Male
1.0Male
12.0Male
23.0Female
5.0Female
6.0Female
56.0Female
3.0Female
24.0Female
4.0Female
24.0Female
512.0Female
5.0Female
2.0Female
16.0Female
31.0Female
12.0Female
7.0Female


Within: Understanding Direction Of Traversal#

the direction of traversal depends entirely on the start and end keyword argument, i.e:

down, start=left(1), end=right(1) is scanning from left to right, so considers cells in the following order:

C

D

E

1

2

Male

3

(1)

(2)

(3)

4

(4)

(5)

(6)

5

(7)

(8)

(9)

down, start=right(1), end=left(1) is scanning from right to left, so the order is:

C

D

E

1

2

Male

3

(3)

(2)

(1)

4

(6)

(5)

(4)

5

(9)

(8)

(7)

The order of traversal doesn’t matter in our example scenario here but there are scenarios in more complex data where it will.

Remember - each column value is resolved exactly once per observation, controlling the order of consideration gives you control where more than one column value could potentially be resolved for a given observation.

In simplest terms the order of traversal controls priority.

Consider the following data:

from tidychef import acquire, filters
from tidychef.selection import Selectable

# Note acquire.python.list_of_lists is a constructor that takes a python object
# it's intended purely for testing and/or quickly mocking up tabulated data
table: Selectable = acquire.python.list_of_lists(
    [
        ["Both", "", "Male", "", "", "Female", ""],
        ["-", "Retired", "Working Age", "Child", "Retired", "Working Age", "Child"],
        ["13", "3", "1", "4", "2", "1", "2"],
        ["19", "3", "3", "5", "3", "4", "1"],
        ["27", "2", "4", "6", "4", "3", "4"],
        ["", "", "", "", "", "", ""],
    ]
)

preview(table)

Unnamed Table

ABCDEFG
1BothMaleFemale
2-RetiredWorking AgeChildRetiredWorking AgeChild
313314212
419335341
527246434
6

Now lets try extracting it with start=left(1), end=right(1)

observations = table.filter(filters.is_numeric).label_as("Observations")
sex = table.excel_ref("1").is_not_blank().label_as("Sex")
preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=right(1), end=left(1)))
)

print(tidy_data)
Observations
Sex

Unnamed Table

ABCDEFG
1BothMaleFemale
2-RetiredWorking AgeChildRetiredWorking AgeChild
313314212
419335341
527246434
6

ObservationsSex
13Both
3Both
1Male
4Male
2Female
1Female
2Female
19Both
3Both
3Male
5Male
3Female
4Female
1Female
27Both
2Both
4Male
6Male
4Female
3Female
4Female


Notice the values for “Both”! These include the leftmost values that should be categorised as Male.

This is purely a result of direction of travel. The observation cells are considered right->centre->left to resolve the correct column value.

As an example: cell B3 (value “3”) resolves to “Both” as its right of “Both” and left of “Male” but right is the priority, a la right->center->left.

Instead, lets define our within relationship with the opposite direction of travel so the obs are resolved left->center->right compared to the column values.

preview(observations, sex)

tidy_data = TidyData(
    observations,
    Column(sex.attach_within(down, start=left(1), end=right(1)))
)

print(tidy_data)
Observations
Sex

Unnamed Table

ABCDEFG
1BothMaleFemale
2-RetiredWorking AgeChildRetiredWorking AgeChild
313314212
419335341
527246434
6

ObservationsSex
13Both
3Male
1Male
4Male
2Female
1Female
2Female
19Both
3Male
3Male
5Male
3Female
4Female
1Female
27Both
2Male
4Male
6Male
4Female
3Female
4Female


And so we get the extraction we wanted.