Acquire#

The acquire module is how data is loaded into tidychef.

The acquire module uses dot notation to specify format and origin (local vs http etc).

Simple examples for common tabular data formats follow.

Note - We’re using excel style cell references for previews throughout these examples. This functionality will be explained in the Preview page of this documentation.

This is a fairly exhaustive list to drive the pattern home, if you’re here for the fits time I’d skim to your relevant format and come back later as needed.

CSV Data From Local#

Creating a single selectable table-like object from a local csv file.

You can download a copy of the example data being used here.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Argument is the location of any csv file on your machine
# This can be a string or a python Path object.
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

CSV Data from Http(s)#

You can also load a csv via http as per the following example:

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Argument is any csv file accessible via http or https
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

Customising Csv Loads#

Both the local and http csv loaders are wrappers around the csv reader from the standard python library csv package and propogate keyword arguments.

This means you can pass any keyword arguments through to acquire.csv.local() and acquire.csv.http() that you could pass to the csv.reader() method.

As an example here we are loading a csv file using a | delimiter in place of commas. This is the example data we’re using.

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

# Lets specify a different delimiter
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/pipe-delimited.csv", delimiter="|")
preview(table)

Unnamed Table

ABC
1AgeMaleFemale
2139
3145412
4903

Note#

📌 All other formats follow the same acquire pattern. Feel free to skip ahead to the type relevant to your workflow.

Loading Xlsx Data#

Creating a list of selectable table-like objects from a local xlsx file.

You can download the example xlsx data being used here.

Note - Some of the following examples are quite big tables so we’re using the preview() keyword bounded= to limit the size of the previews. Again this functionality will be explained in detail in the Preview page of this documentation.

Xlsx Data From Local#

from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
tables: List[XlsxSelectable] = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[0], bounded="A1:B4")

Cover Sheet

AB
1Output in the construction industry reference tables April 2023
2This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3Coverage:Great Britain
4Released:14 June 2023

Rather than using the rather clunky tables[0] syntax, we can also pass a tables= keyword as per the below.

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

Cover Sheet

AB
1Output in the construction industry reference tables April 2023
2This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3Coverage:Great Britain
4Released:14 June 2023

Xlsx Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

Cover Sheet

AB
1Output in the construction industry reference tables April 2023
2This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations
3Coverage:Great Britain
4Released:14 June 2023

Loading Xls Data#

Xls Data From Local#

You can download the example data being used here.

from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.local("../../tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

SalesOrders

ABCDEFG
1OrderDateRegionRepItemUnitsUnit CostTotal
201/06/21EastJonesPencil95.01.99189.05
301/23/21CentralKivellBinder50.019.99999.4999999999999
402/09/21CentralJardinePencil36.04.99179.64
502/26/21CentralGillPen27.019.99539.7299999999999
603/15/21WestSorvinoPencil56.02.99167.44
704/01/21EastJonesBinder60.04.99299.40000000000003

Xls Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

SalesOrders

ABCDEFG
1OrderDateRegionRepItemUnitsUnit CostTotal
201/06/21EastJonesPencil95.01.99189.05
301/23/21CentralKivellBinder50.019.99999.4999999999999
402/09/21CentralJardinePencil36.04.99179.64
502/26/21CentralGillPen27.019.99539.7299999999999
603/15/21WestSorvinoPencil56.02.99167.44
704/01/21EastJonesBinder60.04.99299.40000000000003

Loading Ods Data#

Ods Data From Local#

from tidychef import acquire, preview
from tidychef.selection import OdsSelectable

table: OdsSelectable = acquire.ods.local("../../tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)

Table_of_contents

ABCDE
1Table of contents
2Worksheet numberWorksheet titleDate this data was first published Next publication date
3EB1 in England and WalesEB1- Existing Domestic Properties in England and Wales by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
4EB1 in EnglandEB1- Existing Domestic Properties in England by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
5EB1 in WalesEB1- Existing Domestic Properties in Wales by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
6EB1 by RegionEB1- Existing Domestic Properties by Region by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
7EB1 by Local AuthorityEB1- Existing Domestic Properties by Local Authority by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
8
9
10
11
12

Ods Data From Http(s)#

from tidychef import acquire, preview
from tidychef.selection import OdsSelectable

table: OdsSelectable = acquire.ods.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/ods/EB1-_Existing_Domestic_Properties.ods", tables="Table_of_contents")
preview(table)

Table_of_contents

ABCDE
1Table of contents
2Worksheet numberWorksheet titleDate this data was first published Next publication date
3EB1 in England and WalesEB1- Existing Domestic Properties in England and Wales by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
4EB1 in EnglandEB1- Existing Domestic Properties in England by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
5EB1 in WalesEB1- Existing Domestic Properties in Wales by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
6EB1 by RegionEB1- Existing Domestic Properties by Region by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
7EB1 by Local AuthorityEB1- Existing Domestic Properties by Local Authority by Energy Efficiency RatingThursday 27 July 2023Thursday 26 October 2023
8
9
10
11
12

Using tables=#

Some of you will have wondered why the keyword for filtering to the table you want is “tables” (plural) not “table” singular.

It’s because the string you pass to tables is a regular expression.

We’re not going to go into regular expressions as part of this documentation, but for our purposes just be aware its pattern matching syntax and well worth exploring for anyone working in an ETL role.

As a simple example, you can use regular expressions to create an or statement with the pipe (|) character.

The following example shows how you can do just that to select two table from our xlsx source.

from typing import List
from tidychef import acquire
from tidychef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/tidychef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
                                                 tables="Table 1a|Table 1b")
for table in tables:
    print(table.name)
Table 1a
Table 1b

For many users this ability to only select the tables you want to process will be all that’s required, but the real power here is the ability (where needed) to deal with inconsistencies in table naming.

This is crude (you can do some very clever things with regular expressions should you chose to explore them) but continuing on from our or example lets imagine the following:

tables="Table 1a|table 1a|table 1A|Table 1A|tTable1a|table1a|table1A|Table1A"

Which gives your acquire statement a fair degree of additional robustness, as minor changes or mistakes in table naming by the data publisher are accounted for.

A Note On Http(s) Caching#

All .http() methods described in this section use http caching via the python CacheControl package.

This should cache responses and fetch new data only when the last modified date of the data in question has changed (i.e when the data source has been updated).

Note - I say should here not due to any lack of faith in datachs coding, but because these things are dependant on the site in question correctly following web standards, which is a factor outside of our control.

Where desired, you can toggle this default caching behaviour off as needed by passing cache=False into the .http() function(s).

Example follows:

from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv",
                         cache=False)
preview(table)

Unnamed Table

ABCDEFGHIJK
1
2HousesCarsBoatsHousesCarsBoats
3BeatlesRolling Stones
4John159Keith2610
5Paul2610Mick3711
6George2711Charlie3812
7Ringo4812Ronnie5913
8

For those familiar with the requests python package you can also pass in a requests session object via the session= keyword to control this behaviour however you see fit.