CSV data cleaning

Basics of data cleansing for CSV and what you can do with this tool.

What is data cleaning?

Data cleaning (or data cleansing) means finding and fixing or removing bad, noisy, or inconsistent data so it is reliable for analysis, import, or system integration. For CSV files specifically, the most common issues are invisible characters hidden inside text values, unnecessary whitespace around values, encoding problems that garble characters, and duplicate records in key columns.

Cleaning is not optional when data quality matters. A single invisible character in an ID column can make two identical-looking values fail to match. A trailing space in an email address can create a duplicate account in your database. These problems are completely undetectable by eye and only surface when something breaks downstream — often in a production system.

Types of dirty data in CSV

Invisible characters

Invisible characters are Unicode code points that have no visible glyph but still occupy space in the string. Common examples:

These characters cause matching failures, search misses, and import errors that are extremely difficult to diagnose without a dedicated tool. The single-file check detects them and shows exactly which rows are affected.

Leading and trailing whitespace

Spaces or tabs at the beginning or end of a value are invisible in most spreadsheet and text editor interfaces. ”user@example.com” and ”user@example.com “ look identical on screen but are different strings. This causes:

Full-width / half-width inconsistency

In East Asian datasets, the same character may appear in full-width form (a, 1, !) or half-width form (a, 1, !). If your data mixes both — which is common when data is entered by different people — values that should match will silently fail to. Normalising to half-width for codes and IDs before import prevents these invisible mismatches.

Duplicate records

Duplicate rows or duplicate values in a key column (ID, email, product code) cause unique constraint violations on import and ambiguous results during merges. A related but trickier case: “duplicates” created by whitespace or invisible character differences — the values look the same but are not. See Duplicate data guide for a detailed treatment of detection and resolution.

Encoding problems

Garbled characters are not just a display problem — they break string comparisons, pattern matching, and any downstream processing that expects valid text. Always fix encoding before attempting any other cleaning step; cleaning garbled text is impossible and produces worse results. See Encoding issues.

Where dirty data comes from

Step-by-step cleaning workflow

  1. Fix encoding first — if text looks garbled, run Encoding fix to convert to UTF-8. Applying any other cleaning step to garbled text will produce incorrect results.
  2. Check format — run Format & basic check to confirm encoding, delimiter, column count, and whether there are empty lines or a duplicate header row. Fix structural issues before touching data.
  3. Detect data quality issues — run Single-file check. The report shows: invisible characters and which rows contain them, leading/trailing spaces, duplicate IDs, and full/half-width flags.
  4. Apply fixes — use “Fix all issues” to apply all detected fixes in one click, or handle each category separately. All fixes are applied in your browser; the file is never uploaded to a server.
  5. Download and verify — download the cleaned CSV and run the single-file check again to confirm zero issues before uploading to your system.

Before and after example

A user ID column where some values were copy-pasted from a web page:

user_id,name
U001,Alice
U002​,Bob
U003 ,Carol

Row 2 has a zero-width space after “U002”; row 3 has a trailing regular space after “U003”. Both are invisible on screen. After running single-file check and applying fixes:

user_id,name
U001,Alice
U002,Bob
U003,Carol

All three IDs are now clean strings that will match correctly in any database lookup or JOIN operation.

When not to clean

Not every detected “issue” is actually an error in your data. Review before applying bulk fixes:

Open the tools

Home · Use single-file check