Acquire#
The acquire module is how data is loaded into tidychef.
The acquire module uses dot notation to specify format and origin (local vs http etc).
Simple examples for common tabular data formats follow.
Note - We’re using excel style cell references for previews throughout these examples. This functionality will be explained in the Preview page of this documentation. |
---|
This is a fairly exhaustive list to drive the pattern home, if you’re here for the fits time I’d skim to your relevant format and come back later as needed.
CSV Data From Local#
Creating a single selectable table-like object from a local csv file.
You can download a copy of the example data being used here.
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
# Argument is the location of any csv file on your machine
# This can be a string or a python Path object.
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/bands-wide.csv")
preview(table)
Unnamed Table
A | B | C | D | E | F | G | H | I | J | K | |
1 | |||||||||||
2 | Houses | Cars | Boats | Houses | Cars | Boats | |||||
3 | Beatles | Rolling Stones | |||||||||
4 | John | 1 | 5 | 9 | Keith | 2 | 6 | 10 | |||
5 | Paul | 2 | 6 | 10 | Mick | 3 | 7 | 11 | |||
6 | George | 2 | 7 | 11 | Charlie | 3 | 8 | 12 | |||
7 | Ringo | 4 | 8 | 12 | Ronnie | 5 | 9 | 13 | |||
8 |
CSV Data from Http(s)#
You can also load a csv via http as per the following example:
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
# Argument is any csv file accessible via http or https
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)
Unnamed Table
A | B | C | D | E | F | G | H | I | J | K | |
1 | |||||||||||
2 | Houses | Cars | Boats | Houses | Cars | Boats | |||||
3 | Beatles | Rolling Stones | |||||||||
4 | John | 1 | 5 | 9 | Keith | 2 | 6 | 10 | |||
5 | Paul | 2 | 6 | 10 | Mick | 3 | 7 | 11 | |||
6 | George | 2 | 7 | 11 | Charlie | 3 | 8 | 12 | |||
7 | Ringo | 4 | 8 | 12 | Ronnie | 5 | 9 | 13 | |||
8 |
Customising Csv Loads#
Both the local and http csv loaders are wrappers around the csv reader from the standard python library csv package and propogate keyword arguments.
This means you can pass any keyword arguments through to acquire.csv.local()
and acquire.csv.http()
that you could pass to the csv.reader()
method.
As an example here we are loading a csv file using a |
delimiter in place of commas. This is the example data we’re using.
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
# Lets specify a different delimiter
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/pipe-delimited.csv", delimiter="|")
preview(table)
Unnamed Table
A | B | C | |
1 | Age | Male | Female |
2 | 1 | 3 | 9 |
3 | 14 | 54 | 12 |
4 | 9 | 0 | 3 |
Note#
📌 All other formats follow the same acquire pattern. Feel free to skip ahead to the type relevant to your workflow.
Loading Xlsx Data#
Creating a list of selectable table-like objects from a local xlsx file.
You can download the example xlsx data being used here.
Note - Some of the following examples are quite big tables so we’re using the |
---|
Xlsx Data From Local#
from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
# Note: tables (note plural) as its now a list of tabulated data sources
tables: List[XlsxSelectable] = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[0], bounded="A1:B4")
Cover Sheet
A | B | |
1 | Output in the construction industry reference tables April 2023 | |
2 | This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations | |
3 | Coverage: | Great Britain |
4 | Released: | 14 June 2023 |
Rather than using the rather clunky tables[0]
syntax, we can also pass a tables=
keyword as per the below.
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")
Cover Sheet
A | B | |
1 | Output in the construction industry reference tables April 2023 | |
2 | This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations | |
3 | Coverage: | Great Britain |
4 | Released: | 14 June 2023 |
Xlsx Data From Http(s)#
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable
# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")
Cover Sheet
A | B | |
1 | Output in the construction industry reference tables April 2023 | |
2 | This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations | |
3 | Coverage: | Great Britain |
4 | Released: | 14 June 2023 |
Loading Xls Data#
Xls Data From Local#
You can download the example data being used here.
from tidychef import acquire, preview
from tidychef.selection import XlsSelectable
table: XlsSelectable = acquire.xls.local("../../tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")
SalesOrders
A | B | C | D | E | F | G | |
1 | OrderDate | Region | Rep | Item | Units | Unit Cost | Total |
2 | 01/06/21 | East | Jones | Pencil | 95.0 | 1.99 | 189.05 |
3 | 01/23/21 | Central | Kivell | Binder | 50.0 | 19.99 | 999.4999999999999 |
4 | 02/09/21 | Central | Jardine | Pencil | 36.0 | 4.99 | 179.64 |
5 | 02/26/21 | Central | Gill | Pen | 27.0 | 19.99 | 539.7299999999999 |
6 | 03/15/21 | West | Sorvino | Pencil | 56.0 | 2.99 | 167.44 |
7 | 04/01/21 | East | Jones | Binder | 60.0 | 4.99 | 299.40000000000003 |
Xls Data From Http(s)#
from tidychef import acquire, preview
from tidychef.selection import XlsSelectable
table: XlsSelectable = acquire.xls.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")
SalesOrders
A | B | C | D | E | F | G | |
1 | OrderDate | Region | Rep | Item | Units | Unit Cost | Total |
2 | 01/06/21 | East | Jones | Pencil | 95.0 | 1.99 | 189.05 |
3 | 01/23/21 | Central | Kivell | Binder | 50.0 | 19.99 | 999.4999999999999 |
4 | 02/09/21 | Central | Jardine | Pencil | 36.0 | 4.99 | 179.64 |
5 | 02/26/21 | Central | Gill | Pen | 27.0 | 19.99 | 539.7299999999999 |
6 | 03/15/21 | West | Sorvino | Pencil | 56.0 | 2.99 | 167.44 |
7 | 04/01/21 | East | Jones | Binder | 60.0 | 4.99 | 299.40000000000003 |
Loading Ods Data#
Ods Data From Local#
from tidychef import acquire, preview
from tidychef.selection import OdsSelectable
table: OdsSelectable = acquire.ods.local("../../tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)
Table_of_contents
A | B | C | D | E | |
1 | Table of contents | ||||
2 | Worksheet number | Worksheet title | Date this data was first published | Next publication date | |
3 | EB1 in England and Wales | EB1- Existing Domestic Properties in England and Wales by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
4 | EB1 in England | EB1- Existing Domestic Properties in England by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
5 | EB1 in Wales | EB1- Existing Domestic Properties in Wales by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
6 | EB1 by Region | EB1- Existing Domestic Properties by Region by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
7 | EB1 by Local Authority | EB1- Existing Domestic Properties by Local Authority by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
8 | |||||
9 | |||||
10 | |||||
11 | |||||
12 |
Ods Data From Http(s)#
from tidychef import acquire, preview
from tidychef.selection import OdsSelectable
table: OdsSelectable = acquire.ods.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)
Table_of_contents
A | B | C | D | E | |
1 | Table of contents | ||||
2 | Worksheet number | Worksheet title | Date this data was first published | Next publication date | |
3 | EB1 in England and Wales | EB1- Existing Domestic Properties in England and Wales by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
4 | EB1 in England | EB1- Existing Domestic Properties in England by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
5 | EB1 in Wales | EB1- Existing Domestic Properties in Wales by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
6 | EB1 by Region | EB1- Existing Domestic Properties by Region by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
7 | EB1 by Local Authority | EB1- Existing Domestic Properties by Local Authority by Energy Efficiency Rating | Thursday 27 July 2023 | Thursday 26 October 2023 | |
8 | |||||
9 | |||||
10 | |||||
11 | |||||
12 |
Using tables=#
Some of you will have wondered why the keyword for filtering to the table you want is “tables” (plural) not “table” singular.
It’s because the string you pass to tables is a regular expression.
We’re not going to go into regular expressions as part of this documentation, but for our purposes just be aware its pattern matching syntax and well worth exploring for anyone working in an ETL role.
As a simple example, you can use regular expressions to create an or statement with the pipe (|
) character.
The following example shows how you can do just that to select two table from our xlsx source.
from typing import List
from tidychef import acquire
from tidychef.selection import XlsxSelectable
tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
tables="Table 1a|Table 1b")
for table in tables:
print(table.name)
Table 1a
Table 1b
For many users this ability to only select the tables you want to process will be all that’s required, but the real power here is the ability (where needed) to deal with inconsistencies in table naming.
This is crude (you can do some very clever things with regular expressions should you chose to explore them) but continuing on from our or example lets imagine the following:
tables="Table 1a|table 1a|table 1A|Table 1A|tTable1a|table1a|table1A|Table1A"
Which gives your acquire statement a fair degree of additional robustness, as minor changes or mistakes in table naming by the data publisher are accounted for.
A Note On Http(s) Caching#
All .http()
methods described in this section use http caching via the python CacheControl package.
This should cache responses and fetch new data only when the last modified date of the data in question has changed (i.e when the data source has been updated).
Note - I say should here not due to any lack of faith in datachs coding, but because these things are dependant on the site in question correctly following web standards, which is a factor outside of our control. |
---|
Where desired, you can toggle this default caching behaviour off as needed by passing cache=False
into the .http()
function(s).
Example follows:
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv",
cache=False)
preview(table)
Unnamed Table
A | B | C | D | E | F | G | H | I | J | K | |
1 | |||||||||||
2 | Houses | Cars | Boats | Houses | Cars | Boats | |||||
3 | Beatles | Rolling Stones | |||||||||
4 | John | 1 | 5 | 9 | Keith | 2 | 6 | 10 | |||
5 | Paul | 2 | 6 | 10 | Mick | 3 | 7 | 11 | |||
6 | George | 2 | 7 | 11 | Charlie | 3 | 8 | 12 | |||
7 | Ringo | 4 | 8 | 12 | Ronnie | 5 | 9 | 13 | |||
8 |
For those familiar with the requests python package you can also pass in a requests session object via the session=
keyword to control this behaviour however you see fit.