How to parse product titles unstructured into structured data?

Data Parsing

Product Titles

Structured Data

Data Extraction

Natural Language Processing

How to parse product titles unstructured into structured data?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Parsing product titles into structured data means turning messy text such as brand, model, color, size, and quantity into separate fields that software can reason about. The hard part is not splitting on spaces. It is handling ambiguity, inconsistent abbreviations, missing attributes, and category-specific vocabulary.

Start with a Clear Target Schema

Before writing any parsing code, define what structured output you actually want. For example:

brand,
product type,
size,
color,
quantity,
model or series,
gender or audience where relevant.

Without a target schema, parsing quality is impossible to measure because you do not know what the parser is supposed to extract.

Normalize the Raw Title First

A good first step is basic normalization:

lowercase or standardized casing,
punctuation cleanup,
unit normalization such as oz, ml, and pack,
synonym normalization such as blk to black.

python

1import re
2
3
4def normalize(title: str) -> str:
5    title = title.lower().strip()
6    title = title.replace("blk", "black")
7    title = re.sub(r"\s+", " ", title)
8    return title

This does not solve the whole problem, but it reduces the number of token variants the parser must handle.

Rule-Based Extraction Often Comes First

For many catalogs, a rule-based parser plus dictionaries works surprisingly well. You can look for known colors, sizes, units, and brands.

python

1COLORS = {"red", "blue", "black", "white"}
2SIZES = {"xs", "s", "m", "l", "xl"}
3
4
5def parse_title(title: str) -> dict:
6    tokens = normalize(title).split()
7    result = {
8        "color": None,
9        "size": None,
10        "brand": None,
11    }
12
13    for token in tokens:
14        if token in COLORS:
15            result["color"] = token
16        elif token in SIZES:
17            result["size"] = token
18
19    return result

This is usually the right starting point because it is transparent and easy to debug.

Use Category Context to Reduce Ambiguity

The same token can mean different things in different categories. M may mean medium in clothing, meters in hardware, or model code in electronics. That is why product-title parsing works better when category information is available.

A parser for apparel should not be identical to a parser for power tools. Category-aware rules reduce false positives significantly.

Move to Statistical or ML Models When Rules Plateau

Once rule-based extraction reaches its limit, a machine-learning or sequence-labeling model can help identify attributes in more varied text. But even then, rule-based normalization and dictionaries usually remain part of the pipeline.

In practice, the strongest systems are often hybrid:

normalization and dictionary cleanup first,
rule extraction for easy attributes,
ML for ambiguous or context-dependent fields,
confidence scoring for uncertain parses.

Evaluate on Real Titles, Not Invented Examples

Product title parsing quality depends heavily on the actual catalog language. Build a labeled sample of real titles and measure extraction accuracy per field. That quickly shows whether the parser fails mostly on brands, units, sizes, or ambiguous descriptors.

Common Pitfalls

Starting with regex alone without defining a target schema.
Treating all product categories as if they used the same vocabulary.
Ignoring normalization and forcing every rule to handle messy token variants.
Expecting one parser pass to perfectly infer missing or ambiguous attributes.
Skipping labeled evaluation and relying only on anecdotal examples.

Summary

Product-title parsing starts with a clear schema, not with regex.
Normalization reduces the number of messy text variants the parser must handle.
Rule-based extraction is usually the right first step.
Category context matters because product vocabulary is not universal.
The best production pipelines often combine rules, dictionaries, and ML rather than choosing only one approach.