Advanced Acquisition#
This section explains some of the more advanced things you can do with the acquire
module.
pre_hook=#
A pre hook is an optional keyword of type callable that you can provide to any acquire
methods that:
does something with the argument you pass as source
passes back a variable that will be used as the new source source by “acquire”.
The code is here and is not complicated (it’s literally two lines).
It is however a very useful two lines, examples follow:
Example Scenario#
For our example scenario we’re going to use a url to a json api (a tiny one we’ve made ourselves) as a source.
You can view the contents of the api here.
Our requirements are as follows:
We need to get the url of our source from that api.
The api for the json api will not change.
The urls listed will change.
We want to always acquire the url with “bands-wide” in the path, there will always be one but the number of other urls and its positioning within that list will change.
We want an exception if there’s anything other than exactly one “bands-wide” url.
We want that “bands-wide” url to be passed along to the
acquire
function and used to create our selectable.
Example follows:
import requests
from tidychef import preview
# First we'll create a simple pre hook that meets our requirements.
def select_band_csv(source: str) -> str:
"""
pre hook function to get the correct url from a list of urls
provided by a json api.
"""
r = requests.get(source)
url_dict= r.json()
url_wanted = [x for x in url_dict["datasets"] if "bands-wide" in x]
assert len(url_wanted) == 1
return url_wanted[0]
# Now let's try it out
url = select_band_csv("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/json/api.json")
url
'https://raw.githubusercontent.com/mikeAdamss/datachef/main/tests/fixtures/csv/bands-wide.csv'
And now let’s see how we use this pre hook as a keyword argument to the acquire module.
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
table: CsvSelectable = acquire.csv.http(
"https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/json/api.json",
pre_hook=select_band_csv)
preview(table)
Unnamed Table
A | B | C | D | E | F | G | H | I | J | K | |
1 | |||||||||||
2 | Houses | Cars | Boats | Houses | Cars | Boats | |||||
3 | Beatles | Rolling Stones | |||||||||
4 | John | 1 | 5 | 9 | Keith | 2 | 6 | 10 | |||
5 | Paul | 2 | 6 | 10 | Mick | 3 | 7 | 11 | |||
6 | George | 2 | 7 | 11 | Charlie | 3 | 8 | 12 | |||
7 | Ringo | 4 | 8 | 12 | Ronnie | 5 | 9 | 13 | |||
8 |
And voila! We’re dynamically selecting the csv we want from the options presented by a json api.
post_hook=#
The post hook follows a similar concept but is executed after the data is acquired so operates on whatever is being returned by acquire.
For example:
For
acquire.csv.local()
it operates against a singleCsvSelectable
object.For
acquire.xlsx.local()
it operates against a list ofXlsxSelectable
objects.
In all cases whatever your post_hook
returns is what acquire will return.
As with the pre hook the implementation is nice and simple (see these two lines).
For this example we’ll be a little more ambitious and use a callable class.
NamedSheets
takes a list of table/tab/sheet names that we want and will filter out those we do not want as part of the acquisition function.
from dataclasses import dataclass
from typing import List
from tidychef.selection import XlsxSelectable
@dataclass
class NamedSheets:
sheet_names: List[str]
def __call__(self, sheets: List[XlsxSelectable]):
return [x for x in sheets if x.name in self.sheet_names]
tables: List[XlsxSelectable] = acquire.xlsx.http(
"https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
post_hook=NamedSheets(["Table 3a", "Table 3b"]))
# To show it worked, we'll just iterate and print the table names
for table in tables:
print(table.name)
Table 3a
Table 3b
A Brief Explanation Of Selectables#
There will be more on this in the next section, but a tidychef Selectable
is the primary class we use for selecting cells.
Type |
Description |
---|---|
|
The common class all selectables inherit from. This is where the selection methods all tabulated sources have in common reside |
|
Extends |
|
Extends |
|
Extends |
As an example, imagine a colourful xlsx file - would it not be potentially useful to create a cell colour based selector? Possibly yes, but such a selector would uses xlsx only properties and as such would make no sense and almost certainly not work when called against data ingested from other data formats. The above pattern allows us to create such a method but only expose it to users using xlsx sources.
In other words, selectables enable context appropriate methods.
Using the acquire(selectable=) keyword.#
The acquire
function allows you to overrdie the type of selectable class the data is populating via the selectable=
keyword.
As an (entirely specious and rather pointless) example, lets overwrite the Selectable for acquire.xlsx.http()
such that it returns us a XlsSelectable
rather than the XlsxSelectable
it usually does.
from typing import List
from tidychef import acquire
from tidychef.selection import XlsSelectable
tables: List[XlsSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
selectable=XlsSelectable)
print(type(tables[0]))
<class 'tidychef.selection.xls.xls.XlsSelectable'>
…but… why would you want to do this?
Because Selectables
are designed to be easily extended with powerful custom behaviours and methods, and you don’t want to have to create a new acquire mechanisms every time you wish to do so.
Don’t worry, this technique will be detailed with examples in the next section, for our purposes here you just need to understand how you override the Selectable
class during data acquistion.