CSV data cleaning
Basics of data cleansing for CSV and what you can do with this tool.
What is data cleaning?
Data cleaning (or data cleansing) means finding and fixing or removing bad, noisy, or inconsistent data so it is reliable for analysis, import, or system integration. For CSV files specifically, the most common issues are invisible characters hidden inside text values, unnecessary whitespace around values, encoding problems that garble characters, and duplicate records in key columns.
Cleaning is not optional when data quality matters. A single invisible character in an ID column can make two identical-looking values fail to match. A trailing space in an email address can create a duplicate account in your database. These problems are completely undetectable by eye and only surface when something breaks downstream — often in a production system.
Types of dirty data in CSV
Invisible characters
Invisible characters are Unicode code points that have no visible glyph but still occupy space in the string. Common examples:
- Zero-width space (U+200B) — frequently appears when text is copied from web pages, PDFs, or chat applications
- Non-breaking space (U+00A0) — looks identical to a regular space but is treated as a different character in string comparisons
- Byte order mark (U+FEFF) — can appear mid-field when files are concatenated without stripping BOM from intermediate files
- Control characters (U+0000–U+001F) — sometimes present in exports from legacy mainframe or ERP systems
These characters cause matching failures, search misses, and import errors that are extremely difficult to diagnose without a dedicated tool. The single-file check detects them and shows exactly which rows are affected.
Leading and trailing whitespace
Spaces or tabs at the beginning or end of a value are invisible in most spreadsheet and text editor interfaces. ”user@example.com” and ”user@example.com “ look identical on screen but are different strings. This causes:
- Failed lookups and JOIN operations in databases — the query finds no match because the stored value has a trailing space
- Duplicate records that appear to have the same key but are treated as distinct by the database constraint
- Sorting and grouping errors in reports where leading spaces change sort order
- Broken conditional logic in scripts that compare CSV values to constants
Full-width / half-width inconsistency
In East Asian datasets, the same character may appear in full-width form (a, 1, !) or half-width form (a, 1, !). If your data mixes both — which is common when data is entered by different people — values that should match will silently fail to. Normalising to half-width for codes and IDs before import prevents these invisible mismatches.
Duplicate records
Duplicate rows or duplicate values in a key column (ID, email, product code) cause unique constraint violations on import and ambiguous results during merges. A related but trickier case: “duplicates” created by whitespace or invisible character differences — the values look the same but are not. See Duplicate data guide for a detailed treatment of detection and resolution.
Encoding problems
Garbled characters are not just a display problem — they break string comparisons, pattern matching, and any downstream processing that expects valid text. Always fix encoding before attempting any other cleaning step; cleaning garbled text is impossible and produces worse results. See Encoding issues.
Where dirty data comes from
- Copy-paste from browsers or PDFs — these sources inject zero-width spaces and non-breaking spaces into copied text without any visible indication
- CRM, ERP, or legacy system exports — these often pad fields with spaces, include control characters from internal data representations, or export in a regional encoding
- Manual data entry — accidental leading spaces, inconsistent capitalisation, and mixed full/half-width characters accumulate over time
- Spreadsheet formulas — CONCATENATE or similar functions can accidentally introduce extra spaces or change character encoding
- File merges or concatenation — joining files from different sources without normalisation carries each source’s inconsistencies into the combined file
Step-by-step cleaning workflow
- Fix encoding first — if text looks garbled, run Encoding fix to convert to UTF-8. Applying any other cleaning step to garbled text will produce incorrect results.
- Check format — run Format & basic check to confirm encoding, delimiter, column count, and whether there are empty lines or a duplicate header row. Fix structural issues before touching data.
- Detect data quality issues — run Single-file check. The report shows: invisible characters and which rows contain them, leading/trailing spaces, duplicate IDs, and full/half-width flags.
- Apply fixes — use “Fix all issues” to apply all detected fixes in one click, or handle each category separately. All fixes are applied in your browser; the file is never uploaded to a server.
- Download and verify — download the cleaned CSV and run the single-file check again to confirm zero issues before uploading to your system.
Before and after example
A user ID column where some values were copy-pasted from a web page:
user_id,name U001,Alice U002,Bob U003 ,Carol
Row 2 has a zero-width space after “U002”; row 3 has a trailing regular space after “U003”. Both are invisible on screen. After running single-file check and applying fixes:
user_id,name U001,Alice U002,Bob U003,Carol
All three IDs are now clean strings that will match correctly in any database lookup or JOIN operation.
When not to clean
Not every detected “issue” is actually an error in your data. Review before applying bulk fixes:
- Intentional spaces in values — some product names or addresses legitimately contain leading spaces as part of the value. Review flagged rows before trimming.
- Intentional duplicates — some tables allow multiple rows with the same key (e.g. order line items, transaction logs). Understand your data model before removing duplicates.
- Full-width characters in proper nouns — in Japanese or Korean data, full-width characters in names may be intentional and correct. Normalise only columns where half-width is expected (numeric codes, dates, IDs).
Open the tools
- Single-file check — invisible characters, trim, duplicate IDs
- Format & basic check — encoding, delimiter, column mismatch
- Encoding recovery — fix garbled text before cleaning
- Duplicate data guide — detailed duplicate detection and handling