TidyData: Column: validate=#
The following is an introduction to basic validation in tidychef and usage of the validate=
keyword provided by the Column
class when constructing TidyData
.
Note - this keyword uses exactly the same |
---|
validate= vs .validate()#
The key difference to understand is as follow:
The
<selectable>.validate()
method run against selections so police your extraction logic.Validations via the
Column(validate=)
keyword runs against the output so police your final product.
Consider the following scenarios:
1.) You want to select an “anchor cell” or a selection of cells for the sole purpose of subtracting it from another selection. It could be important to confirm these selections are accurate but because they’re not directly extracted values then
validate=
will never see them (just the consequence of them) so the<selectable>.validate()
is more appropriate.2.) You’re are using
apply=
to cleanse cell value data at the point of extraction and need to make sure the correct things are happening, the<selectable>.validate()
method will never see these cleansed values, butvalidate=
will.
There are nuances on where its best to use each but the pithy version is “use both strategies, wherever possible and as much as its practical to do so”.
Source Data#
The data source we’re using for these examples is shown below:
Note - this particular table has some very verbose headers we don’t care about, so we’ll be using |
---|
The full data source can be downloaded here. We’ll be using th 10th tab named “Table 3c”.
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A4:H10")
Table 3c
A | B | C | D | E | F | G | H | |
4 | Percentage change 3 months on previous 3 months | |||||||
5 | Time period | Public new housing | Private new housing | Total new housing | Infrastructure new work | Public other new work | Private industrial new work | Private commercial new work |
6 | Dataset identifier code | MVO6 | MVO7 | MVO8 | MVO9 | MVP2 | MVP3 | MVP4 |
7 | Jun 2010 | 5.6 | 9.8 | 8.8 | 3 | 4.3 | 3.7 | 1.9 |
8 | Jul 2010 | 2 | 5.6 | 4.8 | 0.2 | -0.2 | 9.7 | 3.5 |
9 | Aug 2010 | 5.5 | 4.5 | 4.7 | -2.9 | -2.9 | 24.4 | 5.9 |
10 | Sep 2010 | 11.7 | 7.5 | 8.5 | -6.8 | -3.3 | 16.1 | 5.3 |
Simple Regex Validation#
For this example we’re going to use the same against
module we used early in this documentation. This has a simple regex validator that works in exactly the way explained above.
i.e against
is just a convenience, you could just define this yourself.
So the following example is an example of finding an invalid cell with the regex provided:
from tidychef import acquire, against
from tidychef.direction import right, down
from tidychef.output import TidyData, Column
from tidychef.selection import XlsxSelectable
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
observations = table.excel_ref("B7:H10").label_as("Observations")
dataset_identifier_code = table.excel_ref("B6").expand(right).label_as("Dataset Identifier Codes")
# Note: matches a regex of capital M followed by anything
tidy_data = TidyData(
observations,
Column(dataset_identifier_code.attach_directly(down), validate=against.regex("L.*"))
)
print(tidy_data)
---------------------------------------------------------------------------
CellValidationError Traceback (most recent call last)
Cell In[2], line 17
11 # Note: matches a regex of capital M followed by anything
12 tidy_data = TidyData(
13 observations,
14 Column(dataset_identifier_code.attach_directly(down), validate=against.regex("L.*"))
15 )
---> 17 print(tidy_data)
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:110, in TidyData.__str__(self)
109 def __str__(self): # pragma: no cover
--> 110 return self.__get_representation()
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:92, in TidyData.__get_representation(self)
88 """
89 Representation logic shared by __str__ and __repr__
90 """
91 if not self._data:
---> 92 self._transform()
94 table_as_strings = self._data_as_table_of_strings()
95 header_row = table_as_strings[0]
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/output/tidydata.py:359, in TidyData._transform(self)
353 standard_columns = [
354 x
355 for x in self.columns
356 if not isinstance(x.engine, HorizontalCondition)
357 ]
358 for column in standard_columns:
--> 359 column_cell = column.resolve_column_cell_from_obs_cell(observation)
360 column_value_dict[column.label] = column_cell.value
361 if column.label not in self.drop:
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/column/base.py:104, in BaseColumn.resolve_column_cell_from_obs_cell(self, observation_cell, *args)
91 """
92 Use the provided lookup engine to return the value
93 of this Column against a given observation, according
(...) 101 :return: A single tidychef Cell object.
102 """
103 cell = self.engine.resolve(observation_cell, *args)
--> 104 cell = self._post_lookup(cell)
105 return cell
File ~/.pyenv/versions/3.12.11/lib/python3.12/site-packages/tidychef/column/column.py:136, in Column._post_lookup(self, cell)
134 else:
135 msg = ""
--> 136 raise CellValidationError(
137 f"""
138 Column {self.label} has a cell that is not valid,
139 Invalid cell {cell}.
140 {msg}
141 """
142 )
144 return cell
CellValidationError:
Column Dataset Identifier Codes has a cell that is not valid,
Invalid cell (B6, value:"MVO6", x:1, y:5).
Message is: "MVO6" does not match pattern: "L.*"
A Note on Lazy Evaluation#
One thing you many notice about the above is that the validation error does not occur until we try and print the tidy_data
variable, this is because the TidyData
class uses lazy evaluation.
Simply put, this means the tidy data is never extracted until the last possible moment that is has to be, in this case when we print.