How to load the saved tokenizer from pretrained model

tokenizer

pretrained model

machine learning

model loading

NLP

How to load the saved tokenizer from pretrained model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When you save or reuse a pretrained NLP model, the tokenizer matters just as much as the model weights. The correct way to reload it is usually to call from_pretrained on the same model name or on the directory created by save_pretrained, so the vocabulary, special tokens, and configuration stay aligned.

Why the tokenizer must match the model

A tokenizer is not just a convenience wrapper around string splitting. It defines how raw text becomes token IDs, how unknown words are handled, which padding token is used, and where special markers such as [CLS] or [SEP] appear.

If you load the wrong tokenizer for a model, the network still receives integers, but those integers no longer mean what the model expects. That can quietly destroy accuracy without causing an obvious runtime error.

Loading directly from a pretrained model name

If you are using a model published on the Hugging Face Hub, the simplest pattern is:

python

1from transformers import AutoTokenizer
2
3tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
4encoded = tokenizer("Tokenizers turn text into model input.")
5
6print(encoded["input_ids"][:10])

This downloads the tokenizer files if needed and restores them with the correct configuration for that model family.

Saving and reloading a tokenizer locally

If you already saved the tokenizer to disk, reload it from that directory:

python

1from transformers import AutoTokenizer
2
3tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
4tokenizer.save_pretrained("./artifacts/tokenizer")
5
6loaded_tokenizer = AutoTokenizer.from_pretrained("./artifacts/tokenizer")
7encoded = loaded_tokenizer("Local reload works too.")
8
9print(encoded["input_ids"][:10])

This is the usual workflow when packaging a model for later inference, shipping it to another machine, or storing it alongside training artifacts.

Reload the tokenizer from the same model directory

If you saved both model and tokenizer into the same folder, load them from that one location:

python

1from transformers import AutoModelForSequenceClassification, AutoTokenizer
2
3model_dir = "./artifacts/sentiment_model"
4
5tokenizer = AutoTokenizer.from_pretrained(model_dir)
6model = AutoModelForSequenceClassification.from_pretrained(model_dir)
7
8batch = tokenizer(
9    ["great product", "not what I expected"],
10    padding=True,
11    truncation=True,
12    return_tensors="pt",
13)
14
15print(batch["input_ids"].shape)

This is the safest deployment pattern because it keeps the model and tokenizer versioned together. It also reduces the chance of accidentally pairing one checkpoint with a vocabulary from another training run.

What gets saved with the tokenizer

Depending on the tokenizer type, the saved directory may include files such as:

'tokenizer_config.json'
'special_tokens_map.json'
'vocab.txt'
'vocab.json'
'merges.txt'

You normally do not load these files one by one. AutoTokenizer.from_pretrained reads the directory and picks the right tokenizer class automatically.

When you added custom tokens

If you extended the vocabulary during training, make sure you save the tokenizer after those additions. Otherwise the reloaded tokenizer will not know about the new tokens and will split or map them incorrectly.

That is one reason it is a bad idea to recreate tokenizers manually from memory. Save the actual tokenizer object used during training and reload that exact artifact later.

Common Pitfalls

Loading a tokenizer from a different model family than the saved model weights.
Saving only the model and forgetting to save the tokenizer artifacts.
Adding custom tokens during training but reloading an older tokenizer directory later.
Manually reconstructing tokenizer files instead of using save_pretrained and from_pretrained.

Summary

Load tokenizers with from_pretrained, using either the model name or a saved local directory.
Save tokenizers with save_pretrained so vocabulary and special token settings are preserved.
Keep the tokenizer and model in the same artifact directory whenever possible.
A mismatched tokenizer can silently break model quality even when inference still runs.