Replace non-ASCII characters with a single space

Non-ASCII replacement

Text processing

Character encoding

String manipulation

Data cleaning

Replace non-ASCII characters with a single space

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Replacing non-ASCII characters with a single space is a common text-cleaning step when a downstream system only accepts basic ASCII. The important detail is that you usually want to replace runs of non-ASCII characters with one space, not emit one space per character and accidentally create long stretches of whitespace.

Match non-ASCII runs, not individual characters

In Python, a simple regular expression can replace any consecutive non-ASCII sequence with one space:

python

1import re
2
3text = "cafe déjà vu 東京"
4cleaned = re.sub(r"[^\x00-\x7F]+", " ", text)
5print(cleaned)

The key pattern is [^\x00-\x7F]+:

'[^\x00-\x7F] means "any character outside the ASCII range"'
'+ means "one or more in a row"'

That + matters because it collapses a whole run into one replacement.

Normalize whitespace after replacement

Once non-ASCII sequences are replaced, you often end up with doubled spaces or leading and trailing whitespace. A second normalization step usually makes the output much cleaner:

python

1import re
2
3def replace_non_ascii_with_single_space(text: str) -> str:
4    text = re.sub(r"[^\x00-\x7F]+", " ", text)
5    text = re.sub(r"\s+", " ", text).strip()
6    return text
7
8
9print(replace_non_ascii_with_single_space("cafe déjà vu 東京"))

This produces a string with stable spacing instead of accidental space clutter.

Be clear about what you are throwing away

Replacing non-ASCII with spaces is a destructive transformation. It removes accented letters, non-Latin scripts, emoji, and many symbols. That may be acceptable for legacy systems or crude tokenization, but it is not appropriate when the original meaning matters.

For example:

'"Málaga" becomes "M laga" if you replace non-ASCII directly'
'"東京" disappears entirely into spacing'
'"naïve" loses the diacritic-bearing character'

If you want "Málaga" to become "Malaga" instead, transliteration is a different problem from space replacement.

Transliteration is a different strategy

Sometimes developers ask for space replacement when what they really want is ASCII approximation. In Python, Unicode normalization can help for accented Latin text:

python

1import unicodedata
2
3def transliterate_to_ascii(text: str) -> str:
4    normalized = unicodedata.normalize("NFKD", text)
5    return normalized.encode("ascii", "ignore").decode("ascii")
6
7
8print(transliterate_to_ascii("déjà vu"))

That is useful for some languages and characters, but it is not the same as replacing non-ASCII with a space.

Pick the rule that matches the downstream use case

A good rule of thumb is:

replace with spaces when you are separating tokens and discarding unsupported characters
transliterate when you want ASCII-like readable text
keep Unicode intact when the system can support it

Many text-cleaning bugs happen because the code uses a harsh replacement rule for data that really should remain Unicode.

Test with representative multilingual input

Do not test only with simple accented Latin examples. Include:

emoji
punctuation from other scripts
combined characters
non-Latin words

That quickly reveals whether your chosen regex rule matches the real cleaning requirement.

Common Pitfalls

Replacing each non-ASCII character individually and creating multiple spaces in a row.
Forgetting to normalize whitespace after the replacement.
Using space replacement when transliteration would better preserve meaning.
Assuming accented Latin letters are the only non-ASCII characters in real data.
Applying destructive ASCII cleanup to text that should remain Unicode.

Summary

Use a regex like [^\x00-\x7F]+ to replace runs of non-ASCII characters with one space.
Normalize whitespace afterward so the result does not contain repeated spaces.
Space replacement is destructive and should be used only when that loss is acceptable.
Transliteration is a separate strategy for preserving approximate readable text.
Test with real multilingual input before standardizing the cleaning rule.