TidyData: Data Markers & Dropping Columns

Contents

TidyData: Data Markers & Dropping Columns#

This section shows you how to deal with extracting data markers.

I’m using the term “data marker” as a catch all term for notations or non numeric indicators that can be attached to a cell value in data tables intended for visual consumption to indicate some nuance regarding the value in question.

Source Data#

The data source we’re using for these examples is shown below:

The full data source can be viewed here.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/data-markers.csv")
preview(table)

Unnamed Table

ABC
1
2TrucksCarsName
3
455John Lennon
596*Ringo Starr
677Mick Jagger
796**Keith Moon
89Pete Townsend
9
10Key
11*Not counting foreign imports
12**Not including white models

For our example we’re going to:

  • Take a “Vehicles” column

  • Take all observations as “raw observations”

  • Use a horizontal condition to create the “Value” column for observations without data markers.

  • Use a horizontal condition to create a “Data Marker” column for just the data markers.

  • “drop” the “raw” column from the tidy data once its served its purpose.

from typing import Dict

from tidychef import acquire, preview
from tidychef.direction import right, down, left, up
from tidychef.output import TidyData, Column
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/data-markers.csv")

anchor = table.re("Name").assert_one().label_as("Anchor Cell")
name = anchor.fill(down).is_not_blank().label_as("Name")
raw_observations = name.fill(left).is_not_blank().label_as("raw")
vehicles = anchor.fill(left).label_as("Vehicles")

preview(anchor, raw_observations, name, vehicles)

def obs_only(line: Dict[str, str]) -> str:
    """
    Returns numeric characters from the "raw" column relative to
    the column cell being populated.
    """
    v = "".join([c for c in line["raw"] if c.isnumeric()])
    return v

def data_markers_only(line: Dict[str, str]) -> str:
    """
    Returns non numeric characters from the "raw" column relative to
    the column cell being populated.
    """
    v = "".join([c for c in line["raw"] if not c.isnumeric()])
    return v

tidy_data = TidyData(
    raw_observations,
    Column.horizontal_condition("Value", obs_only),
    Column.horizontal_condition("Data Marker", data_markers_only),
    Column(name.attach_directly(left)),
    Column(vehicles.attach_directly(down)),
    drop=["raw"]
)
print(tidy_data)
Anchor Cell
raw
Name
Vehicles

Unnamed Table

ABC
1
2TrucksCarsName
3
455John Lennon
596*Ringo Starr
677Mick Jagger
796**Keith Moon
89Pete Townsend
9
10Key
11*Not counting foreign imports
12**Not including white models

ValueData MarkerNameVehicles
5John LennonTrucks
5John LennonCars
9Ringo StarrTrucks
6*Ringo StarrCars
7Mick JaggerTrucks
7Mick JaggerCars
9Keith MoonTrucks
6**Keith MoonCars
9Pete TownsendTrucks