How to add new embeddings for unknown words in Tensorflow training pre-set for testing
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Unknown words are a normal part of NLP pipelines, but "handle unknown tokens" and "grow the embedding table with new words" are two different strategies. In TensorFlow, the safe pattern is to decide the unknown-word policy first, keep training and testing on the same vocabulary contract, and only extend embeddings in a deliberate versioned update.
Start With a Clear Unknown-Token Policy
Most models begin with a reserved unknown token such as "<unk>". Any word outside the known vocabulary maps to that one id.
That is not a hack. It is the baseline design that keeps training, validation, and serving stable.
In this setup, unseen words map to the unknown token id. That gives deterministic behavior and makes experiments comparable.
Extend Vocabulary and Embeddings Together
If you truly need new words to receive distinct embeddings, you must update both the vocabulary mapping and the embedding matrix in the same step. Updating one without the other breaks the token-id contract immediately.
At this point the ids for retriever and reranker only make sense if every tokenizer or lookup layer in the pipeline knows about the updated vocabulary.
Load the Expanded Matrix Into the Model
Once the vocabulary grows, rebuild or reconfigure the embedding layer with the new input size and assign the expanded matrix.
This should usually be part of a controlled retraining or fine-tuning step, not a hidden runtime side effect. New embedding rows start random unless you initialize them from pretrained vectors or related tokens, so they need data and optimization to become useful.
Keep Testing Frozen Unless You Are Explicitly Testing Vocabulary Growth
The phrase "pre-set for testing" is the critical operational point. Testing should normally use a frozen vocabulary and frozen embedding matrix. Otherwise your evaluation set stops being comparable across runs.
A practical rule is:
- freeze the lookup vocabulary for the test run
- freeze the embedding matrix version that matches it
- map unseen test words to
"<unk>"unless the experiment is specifically about vocabulary extension
If you let the tokenizer grow during evaluation, accuracy changes become impossible to interpret. You are no longer measuring only the model. You are also measuring a moving vocabulary boundary.
Consider Subword Models Before Growing the Table
Adding a brand-new embedding row for every unseen word is often the wrong long-term strategy. Many unknown words are rare, misspelled, or morphologically related to known words.
Before extending the table aggressively, consider whether one of these is better:
- a stable unknown token
- subword tokenization
- byte-level or wordpiece-style modeling
- scheduled retraining with curated vocabulary updates
The more often you expand the embedding table, the harder deployment and experiment tracking become. Vocabulary growth should usually be intentional, measurable, and versioned.
Common Pitfalls
The most common mistake is adding new embedding rows without updating the lookup layer or tokenizer that produces token ids.
Another common issue is letting training and testing use different vocabularies while still comparing metrics as if nothing changed. Developers also often keep growing the vocabulary for very rare words that add complexity without real model benefit. Finally, saving model weights without saving the matching vocabulary artifact is a deployment failure waiting to happen.
Summary
- Decide whether unknown words should map to one shared token or trigger vocabulary growth.
- If you extend embeddings, update the vocabulary mapping and embedding matrix together.
- Rebuild or resize the embedding layer so the new ids have real rows to point to.
- Keep testing on a frozen vocabulary unless vocabulary growth is the thing you are evaluating.
- Treat vocabulary and embedding changes as versioned model artifacts, not ad hoc runtime edits.

