how to add text preprocessing tokenization step into Tensorflow model
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In TensorFlow, the standard way to put tokenization directly into a model pipeline is to use tf.keras.layers.TextVectorization. This layer can standardize raw strings, split them into tokens, build or accept a vocabulary, and output integer token IDs or other encodings. The result is a model that can accept raw text instead of requiring preprocessing outside the graph.
The Core Tool: TextVectorization
According to the TensorFlow docs, TextVectorization can:
- standardize text
- tokenize text
- learn a vocabulary with
adapt() - emit integer sequences or dense representations
That makes it the right first choice for many classification and sequence models.
A Simple End-to-End Example
This model accepts raw strings, tokenizes them inside the model, converts tokens to indices, and then trains as usual.
The Role of adapt()
TextVectorization needs a vocabulary. You can provide one directly, or you can let the layer learn it by calling adapt() on training text.
That training-only detail matters. You should adapt on the training set, not on validation or test data, otherwise you leak information across splits.
Where To Put the Preprocessing
You have two common options:
- keep
TextVectorizationinside the model - run it in a
tf.datapipeline before the model
If you want a model that takes raw strings directly, put it inside the model graph as shown above.
If you want preprocessing to happen outside the compiled model, use it in the dataset pipeline:
Both patterns are valid. The best choice depends on deployment and portability needs.
If you plan to export a model that should accept raw user text directly at serving time, putting the vectorizer inside the model is often the cleaner deployment story.
Custom Standardization and Splitting
You can customize normalization before tokenization. For example:
This is useful when punctuation, HTML fragments, or casing rules need special handling.
When TextVectorization Is Not Enough
For some NLP systems, simple tokenization is not the whole story. If you need:
- WordPiece or SentencePiece tokenization
- transformer-specific token IDs
- multilingual or pretrained tokenizers
then you may need TensorFlow Text, KerasNLP, or a model-specific tokenizer outside plain TextVectorization.
Still, for many custom TensorFlow models, TextVectorization is the most direct answer.
Common Pitfalls
The biggest mistake is forgetting to call adapt() when you did not provide a vocabulary manually.
Another mistake is adapting on all available text, including validation and test data.
A third issue is mismatching the vectorizer output with the rest of the model, such as forgetting that an embedding layer expects integer token IDs, not tf-idf vectors.
Summary
- Use
tf.keras.layers.TextVectorizationto add tokenization and basic text preprocessing to a TensorFlow model. - Call
adapt()on training text unless you provide a vocabulary explicitly. - You can place the layer inside the model or in the
tf.datapipeline. - Custom standardization logic is easy to add when needed.
- For transformer-specific tokenization, use a more specialized tokenizer stack.

