Finding and handling duplicate data
Why duplicate IDs and rows are a problem and how to detect and handle them.
Why duplicates are a problem
When importing CSV into a database, duplicate values in a primary or unique key column cause constraint errors that stop the import. When merging two CSVs by key, duplicate keys make it ambiguous which row is the “correct” one — the result is either an error or silent data corruption where the wrong value overwrites a good one. Finding and resolving duplicates before import or merge is one of the most reliable ways to prevent these failures.
How duplicates get into CSV files
Duplicates are rarely introduced intentionally. Common sources:
- Manual data entry — the same record entered twice by different people, or at different times without checking for existing entries
- System migration — data from two systems merged without deduplication produces duplicate IDs from each source
- File concatenation — appending two exports of overlapping time periods or datasets creates rows that appear in both
- Whitespace and invisible characters — “U001” and “U001 “ are different strings, so the database may accept both, creating apparent duplicates that look identical on screen
- Case sensitivity — “Alice@example.com” and “alice@example.com” may be treated as distinct by the importer even though they represent the same user
Types of duplicates
- Exact duplicates
- Every field in the row is identical. The entire row was entered or imported twice. Safe to remove all but one copy.
- Same key, different content
- The key column (ID, email, code) is the same but other fields differ. This could be a versioning issue, a data update that wasn’t applied cleanly, or conflicting entries from two sources. You need to decide which version is correct.
- Near-duplicates (whitespace or invisible character difference)
- The key values look identical but differ by a trailing space, leading space, or invisible character. These are almost always errors and should be cleaned using the single-file check trim and invisible character removal before duplicate detection.
- Legitimate duplicates (non-unique key tables)
- Some tables genuinely allow the same key to appear multiple times — order line items, transaction logs, or event histories. These are not errors and should not be removed. Make sure you are checking the right column.
Choosing the right key column
The key column is the one that should uniquely identify each row in your use case. Common choices:
- ID columns — user_id, product_id, order_id: these are explicitly designed to be unique
- Email addresses — in user data, email is often used as a natural unique key
- Product or SKU codes — should be unique in a product catalog
- Composite keys — sometimes uniqueness is defined by a combination of columns (e.g. date + store_id). For composite keys, you may need to check each column pair separately or merge them into a single column before checking.
If you are unsure which column to use as the key, think about what column your database or system uses as the primary key for that table — that is the one to check.
Step-by-step: detecting duplicates
- Clean first — run Single-file check and apply trim and invisible character removal before checking for duplicates. Near-duplicates caused by whitespace will be resolved in this step, making the duplicate report more accurate.
- Select the key column — in the single-file check, choose the column that should be unique.
- Review the duplicate report — the tool lists which values appear more than once and shows the row numbers of each occurrence. Review each group to understand why the duplicate exists.
- Download and edit — download the CSV and manually remove or merge the duplicate rows according to your business rules. The tool does not automatically delete rows because the correct resolution depends on context.
- Re-check — run single-file check again on the edited file to confirm no duplicates remain.
How to decide which row to keep
- Exact duplicates — keep any one copy; delete the rest.
- Same key, different data — check which row has the more recent or more complete data. If there is a timestamp column, keep the latest. If the data is from two different sources, determine which source is authoritative for this type of record.
- Cannot decide — flag the row for manual review rather than making an automated decision. Import errors are preferable to silently importing the wrong data.
Duplicates in two-file compare
When comparing an old and new version of a CSV using two-file compare, duplicate keys in either file cause the diff to be misleading. The tool aligns rows by key column value; when the same key appears multiple times, alignment breaks and changes are reported incorrectly — rows that were added or removed may show as “changed,” and genuinely changed rows may not be detected. Always remove duplicates from both files before comparing.
Prevention
- Add a duplicate check to your data pipeline as a pre-import validation step, not just a one-time cleanup.
- When exporting from a source system, include a
DISTINCTclause or equivalent to avoid accidental duplicates at the export stage. - Document which column is the unique key for each CSV type your team works with — this prevents repeated “which column do I check?” questions.
Open the tools
- Single-file check — detect and list duplicate IDs
- Format & basic check — column count and structure before import
- Compare two files — diff between versions after deduplication
- CSV cleaning guide — full cleaning workflow including invisible characters and trimming