TidyData: Observation Formatting

TidyData: Observation Formatting#

This section deals with how to format the observations column as part of a tidy data extraction.

Some common examples of where you’d want to do this:

  • The observation data has been created with superfluous “.0”s after all integer observations.

  • The observation data has a very high degree of precision beyond what is required or can be ingested by your data systems.

  • The observations have superflous characters or notations you wish to remove (note: how to capture this sort of data marker metadata as its own column is covered in the intermediate section)

Note: As with everything else in this framework this is just a tool. Whether and to what extent you should be using it depends on your own use case and domain knowledge.

Source Data#

The data source we’re using for these examples is shown below:

The full data source can be viewed here.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

Explanation#

To format the observations you pass a callable to the TidyData constructor as a obs_apply= keyword argument.

Some simple examples follow using lambda functions (though as with the other callables, you could instead use function or classes with the __call__ dunder method).

Lambda Example 1:#

Replace trailing “.0” values with an empty string.

tidy_data = TidyData(
    observations,
    Column(bands.attach_closest(right)),
    obs_apply=lambda x: x.replace(".0", "")
)

Lambda Example 2:#

Remove the last character is the last character is non numeric.

tidy_data = TidyData(
    observations,
    Column(bands.attach_closest(right)),
    obs_apply=lambda x: x[:-1] if not x[:-1].isnumeric() else x
)

The point is that any callable that takes a str and returns a str can be passed to obs_apply.

Example#

To show this in practice we’re just going to prefix each observation with “foo “.

from tidychef import acquire, preview, filters
from tidychef.direction import right, below
from tidychef.output import TidyData, Column
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")

observations = table.filter(filters.is_numeric).label_as("Observations")
bands = (table.excel_ref("A3") | table.excel_ref("G3")).label_as("Bands")
assets = table.excel_ref('2').is_not_blank().label_as("Assets")
members = (table.excel_ref("B") | table.excel_ref("H")).is_not_blank().label_as("Members")
preview(observations, bands, assets, members)

tidy_data = TidyData(
    observations,
    Column(bands.attach_closest(right)),
    Column(assets.attach_directly(below)),
    Column(members.attach_directly(right)),
    obs_apply=lambda x: "foo "+x
)
print(tidy_data)
Observations
Bands
Assets
Members

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

ObservationsBandsAssetsMembers
foo 1BeatlesHousesJohn
foo 5BeatlesCarsJohn
foo 9BeatlesBoatsJohn
foo 2Rolling StonesHousesKeith
foo 6Rolling StonesCarsKeith
foo 10Rolling StonesBoatsKeith
foo 2BeatlesHousesPaul
foo 6BeatlesCarsPaul
foo 10BeatlesBoatsPaul
foo 3Rolling StonesHousesMick
foo 7Rolling StonesCarsMick
foo 11Rolling StonesBoatsMick
foo 2BeatlesHousesGeorge
foo 7BeatlesCarsGeorge
foo 11BeatlesBoatsGeorge
foo 3Rolling StonesHousesCharlie
foo 8Rolling StonesCarsCharlie
foo 12Rolling StonesBoatsCharlie
foo 4BeatlesHousesRingo
foo 8BeatlesCarsRingo
foo 12BeatlesBoatsRingo
foo 5Rolling StonesHousesRonnie
foo 9Rolling StonesCarsRonnie
foo 13Rolling StonesBoatsRonnie