Finding and handling duplicate data

Why duplicate IDs and rows are a problem and how to detect and handle them.

Why duplicates are a problem

When importing CSV into a database, duplicate values in a primary or unique key column cause constraint errors that stop the import. When merging two CSVs by key, duplicate keys make it ambiguous which row is the “correct” one — the result is either an error or silent data corruption where the wrong value overwrites a good one. Finding and resolving duplicates before import or merge is one of the most reliable ways to prevent these failures.

How duplicates get into CSV files

Duplicates are rarely introduced intentionally. Common sources:

Types of duplicates

Exact duplicates
Every field in the row is identical. The entire row was entered or imported twice. Safe to remove all but one copy.
Same key, different content
The key column (ID, email, code) is the same but other fields differ. This could be a versioning issue, a data update that wasn’t applied cleanly, or conflicting entries from two sources. You need to decide which version is correct.
Near-duplicates (whitespace or invisible character difference)
The key values look identical but differ by a trailing space, leading space, or invisible character. These are almost always errors and should be cleaned using the single-file check trim and invisible character removal before duplicate detection.
Legitimate duplicates (non-unique key tables)
Some tables genuinely allow the same key to appear multiple times — order line items, transaction logs, or event histories. These are not errors and should not be removed. Make sure you are checking the right column.

Choosing the right key column

The key column is the one that should uniquely identify each row in your use case. Common choices:

If you are unsure which column to use as the key, think about what column your database or system uses as the primary key for that table — that is the one to check.

Step-by-step: detecting duplicates

  1. Clean first — run Single-file check and apply trim and invisible character removal before checking for duplicates. Near-duplicates caused by whitespace will be resolved in this step, making the duplicate report more accurate.
  2. Select the key column — in the single-file check, choose the column that should be unique.
  3. Review the duplicate report — the tool lists which values appear more than once and shows the row numbers of each occurrence. Review each group to understand why the duplicate exists.
  4. Download and edit — download the CSV and manually remove or merge the duplicate rows according to your business rules. The tool does not automatically delete rows because the correct resolution depends on context.
  5. Re-check — run single-file check again on the edited file to confirm no duplicates remain.

How to decide which row to keep

Duplicates in two-file compare

When comparing an old and new version of a CSV using two-file compare, duplicate keys in either file cause the diff to be misleading. The tool aligns rows by key column value; when the same key appears multiple times, alignment breaks and changes are reported incorrectly — rows that were added or removed may show as “changed,” and genuinely changed rows may not be detected. Always remove duplicates from both files before comparing.

Prevention

Open the tools

Home · Check for duplicates