TidyData: Data Markers & Dropping Columns#
This section shows you how to deal with extracting data markers.
I’m using the term “data marker” as a catch all term for notations or non numeric indicators that can be attached to a cell value in data tables intended for visual consumption to indicate some nuance regarding the value in question.
Source Data#
The data source we’re using for these examples is shown below:
The full data source can be viewed here.
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/data-markers.csv")
preview(table)
Unnamed Table
A | B | C | |
1 | |||
2 | Trucks | Cars | Name |
3 | |||
4 | 5 | 5 | John Lennon |
5 | 9 | 6* | Ringo Starr |
6 | 7 | 7 | Mick Jagger |
7 | 9 | 6** | Keith Moon |
8 | 9 | Pete Townsend | |
9 | |||
10 | Key | ||
11 | * | Not counting foreign imports | |
12 | ** | Not including white models |
For our example we’re going to:
Take a “Vehicles” column
Take all observations as “raw observations”
Use a horizontal condition to create the “Value” column for observations without data markers.
Use a horizontal condition to create a “Data Marker” column for just the data markers.
“drop” the “raw” column from the tidy data once its served its purpose.
from typing import Dict
from tidychef import acquire, preview
from tidychef.direction import right, down, left, up
from tidychef.output import TidyData, Column
from tidychef.selection import CsvSelectable
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/data-markers.csv")
anchor = table.re("Name").assert_one().label_as("Anchor Cell")
name = anchor.fill(down).is_not_blank().label_as("Name")
raw_observations = name.fill(left).is_not_blank().label_as("raw")
vehicles = anchor.fill(left).label_as("Vehicles")
preview(anchor, raw_observations, name, vehicles)
def obs_only(line: Dict[str, str]) -> str:
"""
Returns numeric characters from the "raw" column relative to
the column cell being populated.
"""
v = "".join([c for c in line["raw"] if c.isnumeric()])
return v
def data_markers_only(line: Dict[str, str]) -> str:
"""
Returns non numeric characters from the "raw" column relative to
the column cell being populated.
"""
v = "".join([c for c in line["raw"] if not c.isnumeric()])
return v
tidy_data = TidyData(
raw_observations,
Column.horizontal_condition("Value", obs_only),
Column.horizontal_condition("Data Marker", data_markers_only),
Column(name.attach_directly(left)),
Column(vehicles.attach_directly(down)),
drop=["raw"]
)
print(tidy_data)
Anchor Cell |
raw |
Name |
Vehicles |
Unnamed Table
A | B | C | |
1 | |||
2 | Trucks | Cars | Name |
3 | |||
4 | 5 | 5 | John Lennon |
5 | 9 | 6* | Ringo Starr |
6 | 7 | 7 | Mick Jagger |
7 | 9 | 6** | Keith Moon |
8 | 9 | Pete Townsend | |
9 | |||
10 | Key | ||
11 | * | Not counting foreign imports | |
12 | ** | Not including white models |
Value | Data Marker | Name | Vehicles |
5 | John Lennon | Trucks | |
5 | John Lennon | Cars | |
9 | Ringo Starr | Trucks | |
6 | * | Ringo Starr | Cars |
7 | Mick Jagger | Trucks | |
7 | Mick Jagger | Cars | |
9 | Keith Moon | Trucks | |
6 | ** | Keith Moon | Cars |
9 | Pete Townsend | Trucks |