TidyData: Column: validate=#

The following is an introduction to basic validation in tidychef and usage of the validate= keyword provided by the Column class when constructing TidyData.

Note - this keyword uses exactly the same against module we used when explaining selection.validate() method earlier in this documentation - the difference here is intent (explained below in more detail).

validate= vs .validate()#

The key difference to understand is as follow:

  • The <selectable>.validate() method run against selections so police your extraction logic.

  • Validations via the Column(validate=) keyword runs against the output so police your final product.

Consider the following scenarios:

  • 1.) You want to select an “anchor cell” or a selection of cells for the sole purpose of subtracting it from another selection. It could be important to confirm these selections are accurate but because they’re not directly extracted values then validate= will never see them (just the consequence of them) so the <selectable>.validate() is more appropriate.

  • 2.) You’re are using apply= to cleanse cell value data at the point of extraction and need to make sure the correct things are happening, the <selectable>.validate() method will never see these cleansed values, but validate= will.

There are nuances on where its best to use each but the pithy version is “use both strategies, wherever possible and as much as its practical to do so”.

Source Data#

The data source we’re using for these examples is shown below:

Note - this particular table has some very verbose headers we don’t care about, so we’ll be using bounded= to remove them from the previews as well as to show just the subset of data we’re working with.

The full data source can be downloaded here. We’ll be using th 10th tab named “Table 3c”.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A4:H10")

Table 3c

ABCDEFGH
4Percentage change 3 months on previous 3 months
5Time periodPublic new housingPrivate new housingTotal new housingInfrastructure new workPublic other new workPrivate industrial new workPrivate commercial new work
6Dataset identifier codeMVO6MVO7MVO8MVO9MVP2MVP3MVP4
7Jun 20105.69.88.834.33.71.9
8Jul 201025.64.80.2-0.29.73.5
9Aug 20105.54.54.7-2.9-2.924.45.9
10Sep 201011.77.58.5-6.8-3.316.15.3

Simple Regex Validation#

For this example we’re going to use the same against module we used early in this documentation. This has a simple regex validator that works in exactly the way explained above.

i.e against is just a convenience, you could just define this yourself.

So the following example is an example of finding an invalid cell with the regex provided:

from tidychef import acquire, against
from tidychef.direction import right, down
from tidychef.output import TidyData, Column
from tidychef.selection import XlsxSelectable

table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")

observations = table.excel_ref("B7:H10").label_as("Observations")
dataset_identifier_code = table.excel_ref("B6").expand(right).label_as("Dataset Identifier Codes")

# Note: matches a regex of capital M followed by anything
tidy_data = TidyData(
    observations,
    Column(dataset_identifier_code.attach_directly(down), validate=against.regex("L.*"))
)

print(tidy_data)
---------------------------------------------------------------------------
CellValidationError                       Traceback (most recent call last)
Cell In[2], line 17
     11 # Note: matches a regex of capital M followed by anything
     12 tidy_data = TidyData(
     13     observations,
     14     Column(dataset_identifier_code.attach_directly(down), validate=against.regex("L.*"))
     15 )
---> 17 print(tidy_data)

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:110, in TidyData.__str__(self)
    109 def __str__(self):  # pragma: no cover
--> 110     return self.__get_representation()

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:92, in TidyData.__get_representation(self)
     88 """
     89 Representation logic shared by __str__ and __repr__
     90 """
     91 if not self._data:
---> 92     self._transform()
     94 table_as_strings = self._data_as_table_of_strings()
     95 header_row = table_as_strings[0]

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:359, in TidyData._transform(self)
    353 standard_columns = [
    354     x
    355     for x in self.columns
    356     if not isinstance(x.engine, HorizontalCondition)
    357 ]
    358 for column in standard_columns:
--> 359     column_cell = column.resolve_column_cell_from_obs_cell(observation)
    360     column_value_dict[column.label] = column_cell.value
    361     if column.label not in self.drop:

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/column/base.py:104, in BaseColumn.resolve_column_cell_from_obs_cell(self, observation_cell, *args)
     91 """
     92 Use the provided lookup engine to return the value
     93 of this Column against a given observation, according
   (...)    101 :return: A single tidychef Cell object.
    102 """
    103 cell = self.engine.resolve(observation_cell, *args)
--> 104 cell = self._post_lookup(cell)
    105 return cell

File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/column/column.py:136, in Column._post_lookup(self, cell)
    134         else:
    135             msg = ""
--> 136         raise CellValidationError(
    137             f"""
    138                 Column {self.label} has a cell that is not valid,
    139                 Invalid cell {cell}.
    140                 {msg}                      
    141                 """
    142         )
    144 return cell

CellValidationError: 
                        Column Dataset Identifier Codes has a cell that is not valid,
                        Invalid cell (B6, value:"MVO6", x:1, y:5).
                        Message is: "MVO6" does not match pattern: "L.*"                      
                        

A Note on Lazy Evaluation#

One thing you many notice about the above is that the validation error does not occur until we try and print the tidy_data variable, this is because the TidyData class uses lazy evaluation.

Simply put, this means the tidy data is never extracted until the last possible moment that is has to be, in this case when we print.