Decode HTML entities in Python string?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If a Python string contains HTML entities such as &, <, or ', the usual fix is html.unescape. It converts both named entities and numeric character references back into normal text.
The Built-In Solution: html.unescape
Python already includes the right tool in the standard library.
This is the correct default answer for ordinary HTML-escaped text. It handles cases such as:
- named entities like
& - decimal numeric entities like
' - hexadecimal numeric entities like
'
Because it is part of the standard library, you do not need an extra package for basic entity decoding.
What HTML Entities Actually Are
HTML entities are escaped text representations used inside HTML documents so special characters do not get interpreted as markup.
Examples:
- '
<means<' - '
>means>' - '
&means&' - '
"means"' - '
©means the copyright symbol'
When you scrape pages, process exported HTML, or read data from a CMS, these escaped forms often appear in plain strings.
A Typical Data-Cleaning Example
Here is a small example of decoding a list of strings before storing or displaying them.
This pattern is common in ETL scripts, scrapers, and feed processors.
Decoding Is Not the Same as Stripping HTML Tags
A frequent confusion is mixing entity decoding with HTML parsing.
This string:
contains encoded markup. If you run html.unescape, you get:
That result still contains HTML tags. You decoded the entities, but you did not remove markup. If your real goal is plain text from HTML, you need an HTML parser after decoding, not just unescape.
Handling Double-Escaped Data
Sometimes data has been escaped more than once. For example:
The first unescape gives you <div>Hello</div>, not the final markup. In that kind of pipeline, double-escaping is a data-quality issue, and you should fix it deliberately rather than unescaping repeatedly without understanding the source.
Repeatedly decoding until "it looks right" can corrupt legitimate text.
When You Should Not Decode
If the string is still going to be embedded safely into HTML output, decoding it too early can reintroduce markup-sensitive characters. In other words, decode when you need readable text, not automatically at every layer.
This matters in web apps that ingest text from one source and later re-render it in HTML. Entity decoding and HTML escaping are opposite operations, and doing them at the wrong stage leads to bugs or even security problems.
Common Pitfalls
The most common mistake is trying to decode HTML entities with manual string replacements instead of using html.unescape.
Another mistake is thinking entity decoding removes HTML tags. It does not.
A third pitfall is repeatedly decoding the same string when the real problem is double-escaped or malformed source data.
Summary
- Use
html.unescapeto decode HTML entities in Python. - It handles named entities and numeric character references.
- Decoding entities is different from stripping HTML tags.
- Be careful with double-escaped data and avoid repeated blind decoding.
- Decode when you need plain readable text, not indiscriminately at every layer.

