# TidyData: Duplicate Rows

When you extract from messy data sources you'll sometimes encounter scenarios where exactly the same observation has been included more than once (typically when joining tables).

This document explains how to deal with duplicates during your extractions.

## Source Data

The data source we're using for these examples is shown below:

The [full data source can be viewed here](https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv).

In [None]:
from tidychef import acquire, preview
from tidychef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

## Creating Duplicates

For our example we're going to need duplicated rows. To simulate that we're just going to join two copies of exactly the same tidy data together. This means that **every row will be duplicated** in our example.

First though, we need some tidy data (we'll skip the preview here as you've seen this one a few times already).

In [None]:
from tidychef import acquire, filters
from tidychef.direction import right, below
from tidychef.selection import CsvSelectable
from tidychef.output import TidyData, Column

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/csv/bands-wide.csv")

observations = table.filter(filters.is_numeric).label_as("Observation")
bands = (table.excel_ref("A3") | table.excel_ref("G3")).label_as("Band")
assets = table.excel_ref('2').is_not_blank().label_as("Asset")
members = (table.excel_ref("B") | table.excel_ref("H")).is_not_blank().label_as("Member")

tidy_data = TidyData(
    observations,
    Column(bands.attach_closest(right)),
    Column(assets.attach_directly(below)),
    Column(members.attach_directly(right))
)

Now lets join two of them together and drop the duplicates.

Dropping duplicates is done via the `TidyData.drop_duplicates()` method.

If called without keyword argument duplicates will be dropped with no user feedback given, alternatively you can use one (or both) of the following.

- `print_duplicates` which prints out the rows you've just dropped.
- `csv_duplicate_path` which writes the same information to the specified csv file.

In [None]:
all_tidy_data = TidyData.from_tidy(tidy_data, tidy_data).drop_duplicates(print_duplicates=True, csv_duplicate_path="duplicates.csv")

The above is the result of `print_duplicates=True`.

Below, we'll check the contents of `duplicates.txt` as well.

In [None]:
with open("duplicates.csv") as f:
    print(f.read())