Non-ASCII replacement
Text processing
Character encoding
String manipulation
Data cleaning

Replace non-ASCII characters with a single space

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Replacing non-ASCII characters with a single space is a common text-cleaning step when a downstream system only accepts basic ASCII. The important detail is that you usually want to replace runs of non-ASCII characters with one space, not emit one space per character and accidentally create long stretches of whitespace.

Match non-ASCII runs, not individual characters

In Python, a simple regular expression can replace any consecutive non-ASCII sequence with one space:

python
1import re
2
3text = "cafe déjà vu 東京"
4cleaned = re.sub(r"[^\x00-\x7F]+", " ", text)
5print(cleaned)

The key pattern is [^\x00-\x7F]+:

  • '[^\x00-\x7F] means "any character outside the ASCII range"'
  • '+ means "one or more in a row"'

That + matters because it collapses a whole run into one replacement.

Normalize whitespace after replacement

Once non-ASCII sequences are replaced, you often end up with doubled spaces or leading and trailing whitespace. A second normalization step usually makes the output much cleaner:

python
1import re
2
3def replace_non_ascii_with_single_space(text: str) -> str:
4    text = re.sub(r"[^\x00-\x7F]+", " ", text)
5    text = re.sub(r"\s+", " ", text).strip()
6    return text
7
8
9print(replace_non_ascii_with_single_space("cafe déjà vu 東京"))

This produces a string with stable spacing instead of accidental space clutter.

Be clear about what you are throwing away

Replacing non-ASCII with spaces is a destructive transformation. It removes accented letters, non-Latin scripts, emoji, and many symbols. That may be acceptable for legacy systems or crude tokenization, but it is not appropriate when the original meaning matters.

For example:

  • '"Málaga" becomes "M laga" if you replace non-ASCII directly'
  • '"東京" disappears entirely into spacing'
  • '"naïve" loses the diacritic-bearing character'

If you want "Málaga" to become "Malaga" instead, transliteration is a different problem from space replacement.

Transliteration is a different strategy

Sometimes developers ask for space replacement when what they really want is ASCII approximation. In Python, Unicode normalization can help for accented Latin text:

python
1import unicodedata
2
3def transliterate_to_ascii(text: str) -> str:
4    normalized = unicodedata.normalize("NFKD", text)
5    return normalized.encode("ascii", "ignore").decode("ascii")
6
7
8print(transliterate_to_ascii("déjà vu"))

That is useful for some languages and characters, but it is not the same as replacing non-ASCII with a space.

Pick the rule that matches the downstream use case

A good rule of thumb is:

  • replace with spaces when you are separating tokens and discarding unsupported characters
  • transliterate when you want ASCII-like readable text
  • keep Unicode intact when the system can support it

Many text-cleaning bugs happen because the code uses a harsh replacement rule for data that really should remain Unicode.

Test with representative multilingual input

Do not test only with simple accented Latin examples. Include:

  • emoji
  • punctuation from other scripts
  • combined characters
  • non-Latin words

That quickly reveals whether your chosen regex rule matches the real cleaning requirement.

Common Pitfalls

  • Replacing each non-ASCII character individually and creating multiple spaces in a row.
  • Forgetting to normalize whitespace after the replacement.
  • Using space replacement when transliteration would better preserve meaning.
  • Assuming accented Latin letters are the only non-ASCII characters in real data.
  • Applying destructive ASCII cleanup to text that should remain Unicode.

Summary

  • Use a regex like [^\x00-\x7F]+ to replace runs of non-ASCII characters with one space.
  • Normalize whitespace afterward so the result does not contain repeated spaces.
  • Space replacement is destructive and should be used only when that loss is acceptable.
  • Transliteration is a separate strategy for preserving approximate readable text.
  • Test with real multilingual input before standardizing the cleaning rule.

Course illustration
Course illustration

All Rights Reserved.