Acquire

Acquire#

The acquire module is how data is loaded into tidychef.

The acquire module uses dot notation to specify format and origin (local vs http etc).

Simple examples for common tabular data formats follow.

Note - We’re using excel style cell references for previews throughout these examples. This functionality will be explained in the Preview page of this documentation.

This is a fairly exhaustive list to drive the pattern home, if you’re here for the fits time I’d skim to your relevant format and come back later as needed.

CSV Data From Local#

Creating a single selectable table-like object from a local csv file.

You can download a copy of the example data being used here.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Argument is the location of any csv file on your machine
# This can be a string or a python Path object.
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

	A	B	C	D	E	F	G	H	I	J	K
1
2			Houses	Cars	Boats				Houses	Cars	Boats
3	Beatles						Rolling Stones
4		John	1	5	9			Keith	2	6	10
5		Paul	2	6	10			Mick	3	7	11
6		George	2	7	11			Charlie	3	8	12
7		Ringo	4	8	12			Ronnie	5	9	13
8

CSV Data from Http(s)#

You can also load a csv via http as per the following example:

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Argument is any csv file accessible via http or https
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

	A	B	C	D	E	F	G	H	I	J	K
1
2			Houses	Cars	Boats				Houses	Cars	Boats
3	Beatles						Rolling Stones
4		John	1	5	9			Keith	2	6	10
5		Paul	2	6	10			Mick	3	7	11
6		George	2	7	11			Charlie	3	8	12
7		Ringo	4	8	12			Ronnie	5	9	13
8

Customising Csv Loads#

Both the local and http csv loaders are wrappers around the csv reader from the standard python library csv package and propogate keyword arguments.

This means you can pass any keyword arguments through to acquire.csv.local() and acquire.csv.http() that you could pass to the csv.reader() method.

As an example here we are loading a csv file using a | delimiter in place of commas. This is the example data we’re using.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Lets specify a different delimiter
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/pipe-delimited.csv", delimiter="|")
preview(table)

Unnamed Table

	A	B	C
1	Age	Male	Female
2	1	3	9
3	14	54	12
4	9	0	3

Note#

📌 All other formats follow the same acquire pattern. Feel free to skip ahead to the type relevant to your workflow.

Loading Xlsx Data#

Creating a list of selectable table-like objects from a local xlsx file.

You can download the example xlsx data being used here.

Note - Some of the following examples are quite big tables so we’re using the `preview()` keyword `bounded=` to limit the size of the previews. Again this functionality will be explained in detail in the Preview page of this documentation.

Xlsx Data From Local#

from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
tables: List[XlsxSelectable] = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[0], bounded="A1:B4")

Cover Sheet

	A	B
1	Output in the construction industry reference tables April 2023
2	This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3	Coverage:	Great Britain
4	Released:	14 June 2023

Rather than using the rather clunky tables[0] syntax, we can also pass a tables= keyword as per the below.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

Cover Sheet

	A	B
1	Output in the construction industry reference tables April 2023
2	This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3	Coverage:	Great Britain
4	Released:	14 June 2023

Xlsx Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

Cover Sheet

	A	B
1	Output in the construction industry reference tables April 2023
2	This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3	Coverage:	Great Britain
4	Released:	14 June 2023

Loading Xls Data#

Xls Data From Local#

You can download the example data being used here.

from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.local("../../tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

SalesOrders

	A	B	C	D	E	F	G
1	OrderDate	Region	Rep	Item	Units	Unit Cost	Total
2	01/06/21	East	Jones	Pencil	95.0	1.99	189.05
3	01/23/21	Central	Kivell	Binder	50.0	19.99	999.4999999999999
4	02/09/21	Central	Jardine	Pencil	36.0	4.99	179.64
5	02/26/21	Central	Gill	Pen	27.0	19.99	539.7299999999999
6	03/15/21	West	Sorvino	Pencil	56.0	2.99	167.44
7	04/01/21	East	Jones	Binder	60.0	4.99	299.40000000000003

Xls Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

SalesOrders

	A	B	C	D	E	F	G
1	OrderDate	Region	Rep	Item	Units	Unit Cost	Total
2	01/06/21	East	Jones	Pencil	95.0	1.99	189.05
3	01/23/21	Central	Kivell	Binder	50.0	19.99	999.4999999999999
4	02/09/21	Central	Jardine	Pencil	36.0	4.99	179.64
5	02/26/21	Central	Gill	Pen	27.0	19.99	539.7299999999999
6	03/15/21	West	Sorvino	Pencil	56.0	2.99	167.44
7	04/01/21	East	Jones	Binder	60.0	4.99	299.40000000000003

Loading Ods Data#

Ods Data From Local#

from tidychef import acquire, preview
from tidychef.selection import OdsSelectable

table: OdsSelectable = acquire.ods.local("../../tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)

Table_of_contents

	A	B	C	D	E
1	Table of contents
2	Worksheet number	Worksheet title	Date this data was first published	Next publication date
3	EB1 in England and Wales	EB1- Existing Domestic Properties in England and Wales by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
4	EB1 in England	EB1- Existing Domestic Properties in England by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
5	EB1 in Wales	EB1- Existing Domestic Properties in Wales by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
6	EB1 by Region	EB1- Existing Domestic Properties by Region by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
7	EB1 by Local Authority	EB1- Existing Domestic Properties by Local Authority by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
8
9
10
11
12

Ods Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import OdsSelectable

table: OdsSelectable = acquire.ods.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)

Table_of_contents

	A	B	C	D	E
1	Table of contents
2	Worksheet number	Worksheet title	Date this data was first published	Next publication date
3	EB1 in England and Wales	EB1- Existing Domestic Properties in England and Wales by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
4	EB1 in England	EB1- Existing Domestic Properties in England by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
5	EB1 in Wales	EB1- Existing Domestic Properties in Wales by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
6	EB1 by Region	EB1- Existing Domestic Properties by Region by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
7	EB1 by Local Authority	EB1- Existing Domestic Properties by Local Authority by Energy Efficiency Rating	Thursday 27 July 2023	Thursday 26 October 2023
8
9
10
11
12

Using tables=#

Some of you will have wondered why the keyword for filtering to the table you want is “tables” (plural) not “table” singular.

It’s because the string you pass to tables is a regular expression.

We’re not going to go into regular expressions as part of this documentation, but for our purposes just be aware its pattern matching syntax and well worth exploring for anyone working in an ETL role.

As a simple example, you can use regular expressions to create an or statement with the pipe (|) character.

The following example shows how you can do just that to select two table from our xlsx source.

from typing import List
from tidychef import acquire
from tidychef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
                                                 tables="Table 1a|Table 1b")
for table in tables:
    print(table.name)

Table 1a
Table 1b

For many users this ability to only select the tables you want to process will be all that’s required, but the real power here is the ability (where needed) to deal with inconsistencies in table naming.

This is crude (you can do some very clever things with regular expressions should you chose to explore them) but continuing on from our or example lets imagine the following:

tables="Table 1a|table 1a|table 1A|Table 1A|tTable1a|table1a|table1A|Table1A"

Which gives your acquire statement a fair degree of additional robustness, as minor changes or mistakes in table naming by the data publisher are accounted for.

A Note On Http(s) Caching#

All .http() methods described in this section use http caching via the python CacheControl package.

This should cache responses and fetch new data only when the last modified date of the data in question has changed (i.e when the data source has been updated).

Note - I say should here not due to any lack of faith in datachs coding, but because these things are dependant on the site in question correctly following web standards, which is a factor outside of our control.

Where desired, you can toggle this default caching behaviour off as needed by passing cache=False into the .http() function(s).

Example follows:

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv",
                         cache=False)
preview(table)

Unnamed Table

	A	B	C	D	E	F	G	H	I	J	K
1
2			Houses	Cars	Boats				Houses	Cars	Boats
3	Beatles						Rolling Stones
4		John	1	5	9			Keith	2	6	10
5		Paul	2	6	10			Mick	3	7	11
6		George	2	7	11			Charlie	3	8	12
7		Ringo	4	8	12			Ronnie	5	9	13
8

For those familiar with the requests python package you can also pass in a requests session object via the session= keyword to control this behaviour however you see fit.

Acquire

Contents

Acquire#

CSV Data From Local#

Unnamed Table

CSV Data from Http(s)#

Unnamed Table

Customising Csv Loads#

Unnamed Table

Note#

Loading Xlsx Data#

Xlsx Data From Local#

Cover Sheet

Cover Sheet

Xlsx Data From Http(s)#

Cover Sheet

Loading Xls Data#

Xls Data From Local#

SalesOrders

Xls Data From Http(s)#

SalesOrders

Loading Ods Data#

Ods Data From Local#

Table_of_contents

Ods Data From Http(s)#

Table_of_contents

Using tables=#

A Note On Http(s) Caching#

Unnamed Table