Against (Creating New Validators)#

You will have seen use the against module as part of selection validation and column validation.

This against module is really just a wrapper around some simple validation classes, this document explains what they are, how they work and how to make your own.

How it works#

A Validator in tidychef is any class that inherits from tidychef.against.implementation.base.BaseValidator.

As a convenience the code for for this class is shown below (don’t worry, it’s actually very straight forward).

So a validator does exactly two things:

    1. Returns True or False when passed a Cell is passed to the class as an argument

    1. Returns a str message explaining the issue when its .msg() method is called with a Cell as an argument.

Note - The bits around abc and abstract classes are just programmer boiler plate, all they do is tell you if you forget to create one of the two listed methods (a useful thing!) otherwise you can safely ignore these bits as boiler plate.

Create a new Validator#

For this example we’re going to create a validator that confirms that a selected cell holds a value that is considered numerical (you’ll notice this code is the code from the regex validator with a handful of small changes which we’ll list below).

from dataclasses import dataclass

from tidychef.models.source.cell import Cell
from tidychef.against.implementations.base import BaseValidator

@dataclass
class NumericValidator(BaseValidator):
    
    def __call__(self, cell: Cell) -> bool:
        """
        Is the value property of the Cell numeric
        """
        return cell.value.isnumeric()
    
    def msg(self, cell: Cell) -> str:
        """
        Provide a contextually meaningful
        message to the user where cell
        value is not numeric
        """
        return f'"{cell.value}" is not numeric'

Changes we made

  • We’ve removed the pattern variables since our validator doesn’t need arguments.

  • We’ve used the isnumeric method (that all python strings have) to return True or False when the class gets called (via __call__).

  • We’ve updated the message.

  • We’ve updated the docstrings and name of the class.

  • We’ve removed import re since we’re not using it.

Now lets try it:

from tidychef.models.source.cell import Cell

# Create a simple Cell object for testing
# Note the value here is numeric!
valid_cell = Cell(x="0", y="0", value="1")

# Try it out - this will be return True (1 is numeric)
validator = NumericValidator()
print(f'Should be True: {validator(valid_cell)}')

# Now lets try a Cell with a non numeric value
invalid_cell = Cell(x="0", y="0", value="foo")
print(f'Should be False: {validator(invalid_cell)}')

# Lastly lets call .msg() on our failed validation and print it
print(validator.msg(invalid_cell))
Should be True: True
Should be False: False
"foo" is not numeric

Putting it all together#

Now lets see our new validator is action using the <selectable>.validate() method.

You could equally use this new validator with Columns(validate=) (exactly the same validator class is used in both instances).

Source Data#

The data source we’re using for these examples is shown below:

The full data source can be viewed here.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

For this example we’re going to select all non blank values from columns C & D and run our new validator against them.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")

# Select them
selection = (table.excel_ref('C') | table.excel_ref('D')).is_not_blank()

# Preview the selection - PRIOR to validation
# note - for the sake of this example, you'd just do it in one typically
preview(selection)

# Noe validate it
numeric_validator = NumericValidator()
selection.validate(numeric_validator)
Unnamed Selection: 0

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

---------------------------------------------------------------------------
CellValidationError                       Traceback (most recent call last)
Cell In[4], line 15
     13 # Noe validate it
     14 numeric_validator = NumericValidator()
---> 15 selection.validate(numeric_validator)

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/selection/selectable.py:468, in Selectable.validate(self, validator, raise_first_error)
    465                     validation_errors.append(validator.msg(cell))
    467         if len(validation_errors) > 0:
--> 468             raise CellValidationError(
    469                 f"""
    470 When making selections from table: {self.name} the
    471 following validation errors were encountered:
    472 {linesep.join(validation_errors)}
    473                 """
    474             )
    476         return self

CellValidationError: 
When making selections from table: Unnamed Table the
following validation errors were encountered:
"Houses" is not numeric
"Cars" is not numeric
                

Further steps#

At this point we’re approaching the point where a lesson in tidychef becomes a lesson in python (so this is probably more one for the programmers) but consider the following.

  • Validators are simple python classes and can be freely constructed.

  • For any given data pipeline project you will always have a source of truth.

  • Validators can give you contextual information during the transformation process regarding exactly what and where the issue is.

So in terms of possibilities…

  • If you’re desired output uses codelists - create a validator to compare your extracted values to said codelists.

  • If you’re desired output has schemas - create a validator to compare your extracted values to said schema.

  • If you’re desired output ends up on a restful api - create a validator to pull the valid values off said api and validate your extracted values against them.

  • etc etc

You can always of course just validate your extracted data after its written to disk (a completely sensible thing to do) via <whatever your tool of choice is>, but there is an argument for validating early and often (and while you have context for exactly what in the extraction process has caused the issue).

As always it will depend on your own use case and requirements.