Python
HTML entities
string manipulation
text processing
coding tutorials

Decode HTML entities in Python string?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

If a Python string contains HTML entities such as &, <, or ', the usual fix is html.unescape. It converts both named entities and numeric character references back into normal text.

The Built-In Solution: html.unescape

Python already includes the right tool in the standard library.

python
1import html
2
3text = "Tom & Jerry <cartoon> 'classic'"
4decoded = html.unescape(text)
5
6print(decoded)
text
Tom & Jerry <cartoon> 'classic'

This is the correct default answer for ordinary HTML-escaped text. It handles cases such as:

  • named entities like &amp;
  • decimal numeric entities like &#39;
  • hexadecimal numeric entities like &#x27;

Because it is part of the standard library, you do not need an extra package for basic entity decoding.

What HTML Entities Actually Are

HTML entities are escaped text representations used inside HTML documents so special characters do not get interpreted as markup.

Examples:

  • '&lt; means <'
  • '&gt; means >'
  • '&amp; means &'
  • '&quot; means "'
  • '&#169; means the copyright symbol'

When you scrape pages, process exported HTML, or read data from a CMS, these escaped forms often appear in plain strings.

A Typical Data-Cleaning Example

Here is a small example of decoding a list of strings before storing or displaying them.

python
1import html
2
3items = [
4    "Alice &amp; Bob",
5    "5 &lt; 10",
6    "Use &#x27;quotes&#x27; here",
7]
8
9cleaned = [html.unescape(item) for item in items]
10print(cleaned)
text
['Alice & Bob', '5 < 10', "Use 'quotes' here"]

This pattern is common in ETL scripts, scrapers, and feed processors.

Decoding Is Not the Same as Stripping HTML Tags

A frequent confusion is mixing entity decoding with HTML parsing.

This string:

python
text = "&lt;b&gt;Hello&lt;/b&gt;"

contains encoded markup. If you run html.unescape, you get:

python
1import html
2
3text = "&lt;b&gt;Hello&lt;/b&gt;"
4print(html.unescape(text))
text
<b>Hello</b>

That result still contains HTML tags. You decoded the entities, but you did not remove markup. If your real goal is plain text from HTML, you need an HTML parser after decoding, not just unescape.

Handling Double-Escaped Data

Sometimes data has been escaped more than once. For example:

python
1import html
2
3text = "&amp;lt;div&amp;gt;Hello&amp;lt;/div&amp;gt;"
4print(html.unescape(text))

The first unescape gives you &lt;div&gt;Hello&lt;/div&gt;, not the final markup. In that kind of pipeline, double-escaping is a data-quality issue, and you should fix it deliberately rather than unescaping repeatedly without understanding the source.

Repeatedly decoding until "it looks right" can corrupt legitimate text.

When You Should Not Decode

If the string is still going to be embedded safely into HTML output, decoding it too early can reintroduce markup-sensitive characters. In other words, decode when you need readable text, not automatically at every layer.

This matters in web apps that ingest text from one source and later re-render it in HTML. Entity decoding and HTML escaping are opposite operations, and doing them at the wrong stage leads to bugs or even security problems.

Common Pitfalls

The most common mistake is trying to decode HTML entities with manual string replacements instead of using html.unescape.

Another mistake is thinking entity decoding removes HTML tags. It does not.

A third pitfall is repeatedly decoding the same string when the real problem is double-escaped or malformed source data.

Summary

  • Use html.unescape to decode HTML entities in Python.
  • It handles named entities and numeric character references.
  • Decoding entities is different from stripping HTML tags.
  • Be careful with double-escaped data and avoid repeated blind decoding.
  • Decode when you need plain readable text, not indiscriminately at every layer.

Course illustration
Course illustration

All Rights Reserved.