Best Practice#
This section covers some best practice approaches and techniques for creating robust transform recipes with tidychef.
Conditional Is Better Then Explicit#
If you want to create a robust repeatable recipe then a conditional selection (particularly simple conditions such as .expand()
and .fill()
) are typically better than explicit .excel_ref()
statements.
This is because many tabulated sources are intended to be extended over time.
Consider the following:
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Table 3c")
preview(table, bounded="A155:H161")
Table 3c
A | B | C | D | E | F | G | H | |
155 | Oct 2022 | -3.3 | 0.6 | 0.2 | 5.3 | -2.2 | 0.8 | 1.5 |
156 | Nov 2022 | 2 | -0.4 | -0.2 | 6.2 | 4 | -0.5 | 0.5 |
157 | Dec 2022 | 7 | -2.1 | -1.1 | 7.3 | 10 | -1.8 | -1.7 |
158 | Jan 2023 | 7.1 | -5.3 | -4 | 6.3 | 8.9 | 0.3 | -1.1 |
159 | Feb 2023 | 5 | -5.1 | -4 | 2.9 | 4.4 | 0.8 | 0.3 |
160 | Mar 2023 | 1.3 | -5.3 | -4.5 | -1.9 | 0.2 | 4.2 | 0.7 |
161 | Apr 2023 | -0.8 | -3.3 | -3 | -1.5 | 2 | 5.4 | 0.9 |
It’s pretty easy to see that the creator of this dataset will be adding more months of data in the future - and you want your script to capture this additional data on reruns.
Now consider our two different techniques of selection.
excel_ref()#
A statement of something like table.excel_ref("B155:B161")
will work but when the user inevitably republishes the source with additional rows this additional data will not be extracted when the recipe is reran.
expand()#
Whereas a statement of table.excel_ref("B155").expand(down).expand(right)
would automatically extract the additional data upon a re run of the script with no code update required.
When to use Validation#
If you’re doing a one off transform you’ve not intention of repeating (you’re quickly pulling some data apart of populate a dataframe, chart etc) then you could (possibly) argue that validation is overkill.
If you’re intending run your code more than once against a changeable data source you should be implementing some form of validation. Even if its just one of the lightweight default methods provided by tidychef that are documented elsewhere in this Jupyterbook.
If you’re setting up any kind of ETL pipeline, you should almost certainly be investing time in a quality validation approach suitable for your pipeline processes and use case(s).
Select Wide and Validate Narrow#
Sharp observers will have noticed you can us regex to select cells and also use regex to also validate selection and also validate extracted column values - isn’t this redundant?
No, because there are very different goals for selection vs validation, as a general rule:
You want to keep your selection techniques wide so that the cells get selected even if they contain invalid values (perhaps especially if they contain invalid values) - provided ofc these calls are located structurally where valid values should be.
Your validation then needs to be narrow, precise and unforgiving.
You can sum this up with the following statement:
You cannot stop human beings making human errors, but you CAN set things up so you know WHEN and precisely WHERE any such error is encountered.
Consider this - if you use a strict regular expression to select only valid values then this just means you completely skip the invalid values.
What if its just a typo? “Maale” in place of Male? Do you want “Male” to be missing from your output?
What if you use an
is_numeric
filter but a user adds a data marker of*
in place of an observation?
My point is there are nuances here, you dont want to process invalid values, but you do (on balance) want to select them (if they’re located where valid values should be).
The trick here is to use the provided tools to make it obvious where and precisely what the problem is so it can be trivially addressed. Judicious and well targeted usage of validation will get you that.