How can I detect the encoding/codepage of a text file?

file encoding

text file

codepage detection

character encoding

encoding identification

How can I detect the encoding/codepage of a text file?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

You usually cannot detect a text file's encoding with perfect certainty unless the file contains explicit metadata such as a BOM or an external contract. In practice, encoding detection is a mix of hard evidence, heuristics, and fallback policy.

Start With What Can Be Known Exactly

Some files declare their encoding directly through a byte order mark. Common BOM patterns include:

UTF-8: EF BB BF
UTF-16 LE: FF FE
UTF-16 BE: FE FF
UTF-32 LE: FF FE 00 00
UTF-32 BE: 00 00 FE FF

If a BOM is present, detection is easy.

python

1def detect_bom(data: bytes):
2    if data.startswith(b"\xef\xbb\xbf"):
3        return "utf-8-sig"
4    if data.startswith(b"\xff\xfe\x00\x00"):
5        return "utf-32-le"
6    if data.startswith(b"\x00\x00\xfe\xff"):
7        return "utf-32-be"
8    if data.startswith(b"\xff\xfe"):
9        return "utf-16-le"
10    if data.startswith(b"\xfe\xff"):
11        return "utf-16-be"
12    return None

That is real detection, not guesswork.

Heuristics Begin Where Evidence Ends

If there is no BOM, the situation becomes probabilistic. Some encodings are easier to infer than others:

valid UTF-8 often has recognizable byte patterns,
pure ASCII is compatible with several encodings,
Windows-1252 and ISO-8859-1 are notoriously ambiguous,
and short files are much harder to classify than long ones.

This is why "detect the codepage" is not a question with a perfect universal answer. The bytes may simply be valid under multiple interpretations.

A Practical Detection Function in Python

One reasonable workflow is:

check for BOM,
try strict UTF-8,
fall back to a heuristic library,
and define a default fallback such as Windows-1252 or UTF-8 depending on your environment.

python

1import chardet
2
3def detect_encoding(path):
4    raw = path.read_bytes()
5
6    bom = detect_bom(raw)
7    if bom:
8        return bom
9
10    try:
11        raw.decode("utf-8")
12        return "utf-8"
13    except UnicodeDecodeError:
14        pass
15
16    guess = chardet.detect(raw)
17    return guess

This does not guarantee correctness, but it is a practical approach.

Why Ambiguity Is Unavoidable

Consider a file containing only English letters and punctuation. Those bytes are valid in:

ASCII,
UTF-8,
ISO-8859-1,
Windows-1252,

and several others.

No algorithm can infer the author's intended encoding from those bytes alone because the file content does not distinguish the possibilities.

That is the fundamental reason encoding detection can never be perfect in the general case.

Prefer Metadata Over Guessing

If you control the file format, the best solution is not better guessing. It is better metadata.

Examples:

always write UTF-8,
include a BOM where appropriate,
store encoding in a sidecar file or protocol header,
or document the encoding contract explicitly.

Detection libraries are useful mainly because many real systems inherit files without reliable metadata.

Handling Detection Results Safely

Detection libraries usually return both a guess and a confidence score. Treat that score as guidance, not proof.

A practical rule is:

trust BOMs strongly,
trust strict UTF-8 decoding reasonably,
treat legacy single-byte guesses cautiously,
and allow manual override in user-facing tools.

This is especially important if the text may contain names, legal content, or any data where silent corruption is unacceptable.

Example With Confidence

python

1from pathlib import Path
2import chardet
3
4path = Path("example.txt")
5raw = path.read_bytes()
6
7result = chardet.detect(raw)
8print(result)

A result may look like:

python

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

That is a useful guess, not a guarantee.

Common Pitfalls

The biggest pitfall is expecting perfect detection for files with no BOM or metadata. That is impossible in the general case.

Another mistake is treating a library guess as certain, especially for legacy encodings that overlap heavily.

Developers also often ignore the environment. A text file created by an old Windows application may have a much more likely default encoding than one produced by a modern Unix toolchain.

Finally, if you control both producer and consumer, stop relying on detection and enforce a fixed encoding policy instead.

Summary

BOMs give the strongest and most reliable encoding signal.
Without metadata, encoding detection becomes heuristic rather than exact.
UTF-8 can often be tested directly, but many legacy encodings remain ambiguous.
Detection libraries are useful, but their guesses are not guarantees.
If you control the format, standardize the encoding instead of trying to infer it later.