TidyData: Duplicate Rows#
When you extract from messy data sources you’ll sometimes encounter scenarios where exactly the same observation has been included more than once (typically when joining tables).
This document explains how to deal with duplicates during your extractions.
Source Data#
The data source we’re using for these examples is shown below:
The full data source can be viewed here.
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)
Unnamed Table
A | B | C | D | E | F | G | H | I | J | K | |
1 | |||||||||||
2 | Houses | Cars | Boats | Houses | Cars | Boats | |||||
3 | Beatles | Rolling Stones | |||||||||
4 | John | 1 | 5 | 9 | Keith | 2 | 6 | 10 | |||
5 | Paul | 2 | 6 | 10 | Mick | 3 | 7 | 11 | |||
6 | George | 2 | 7 | 11 | Charlie | 3 | 8 | 12 | |||
7 | Ringo | 4 | 8 | 12 | Ronnie | 5 | 9 | 13 | |||
8 |
Creating Duplicates#
For our example we’re going to need duplicated rows. To simulate that we’re just going to join two copies of exactly the same tidy data together. This means that every row will be duplicated in our example.
First though, we need some tidy data (we’ll skip the preview here as you’ve seen this one a few times already).
from tidychef import acquire, filters
from tidychef.direction import right, below
from tidychef.selection import CsvSelectable
from tidychef.output import TidyData, Column
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
observations = table.filter(filters.is_numeric).label_as("Observation")
bands = (table.excel_ref("A3") | table.excel_ref("G3")).label_as("Band")
assets = table.excel_ref('2').is_not_blank().label_as("Asset")
members = (table.excel_ref("B") | table.excel_ref("H")).is_not_blank().label_as("Member")
tidy_data = TidyData(
observations,
Column(bands.attach_closest(right)),
Column(assets.attach_directly(below)),
Column(members.attach_directly(right))
)
Now lets join two of them together and drop the duplicates.
Dropping duplicates is done via the TidyData.drop_duplicates()
method.
If called without keyword argument duplicates will be dropped with no user feedback given, alternatively you can use one (or both) of the following.
print_duplicates
which prints out the rows you’ve just dropped.csv_duplicate_path
which writes the same information to the specified csv file.
all_tidy_data = TidyData.from_tidy(tidy_data, tidy_data).drop_duplicates(print_duplicates=True, csv_duplicate_path="duplicates.csv")
Removed duplicate instances of the following row(s):
-----------------------------------------------------
1,Beatles,Houses,John
5,Beatles,Cars,John
9,Beatles,Boats,John
2,Rolling Stones,Houses,Keith
6,Rolling Stones,Cars,Keith
10,Rolling Stones,Boats,Keith
2,Beatles,Houses,Paul
6,Beatles,Cars,Paul
10,Beatles,Boats,Paul
3,Rolling Stones,Houses,Mick
7,Rolling Stones,Cars,Mick
11,Rolling Stones,Boats,Mick
2,Beatles,Houses,George
7,Beatles,Cars,George
11,Beatles,Boats,George
3,Rolling Stones,Houses,Charlie
8,Rolling Stones,Cars,Charlie
12,Rolling Stones,Boats,Charlie
4,Beatles,Houses,Ringo
8,Beatles,Cars,Ringo
12,Beatles,Boats,Ringo
5,Rolling Stones,Houses,Ronnie
9,Rolling Stones,Cars,Ronnie
13,Rolling Stones,Boats,Ronnie
The above is the result of print_duplicates=True
.
Below, we’ll check the contents of duplicates.txt
as well.
with open("duplicates.csv") as f:
print(f.read())
1,Beatles,Houses,John
5,Beatles,Cars,John
9,Beatles,Boats,John
2,Rolling Stones,Houses,Keith
6,Rolling Stones,Cars,Keith
10,Rolling Stones,Boats,Keith
2,Beatles,Houses,Paul
6,Beatles,Cars,Paul
10,Beatles,Boats,Paul
3,Rolling Stones,Houses,Mick
7,Rolling Stones,Cars,Mick
11,Rolling Stones,Boats,Mick
2,Beatles,Houses,George
7,Beatles,Cars,George
11,Beatles,Boats,George
3,Rolling Stones,Houses,Charlie
8,Rolling Stones,Cars,Charlie
12,Rolling Stones,Boats,Charlie
4,Beatles,Houses,Ringo
8,Beatles,Cars,Ringo
12,Beatles,Boats,Ringo
5,Rolling Stones,Houses,Ronnie
9,Rolling Stones,Cars,Ronnie
13,Rolling Stones,Boats,Ronnie