how to add text preprocessing tokenization step into Tensorflow model

TensorFlow

text preprocessing

tokenization

machine learning

deep learning

how to add text preprocessing tokenization step into Tensorflow model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In TensorFlow, the standard way to put tokenization directly into a model pipeline is to use tf.keras.layers.TextVectorization. This layer can standardize raw strings, split them into tokens, build or accept a vocabulary, and output integer token IDs or other encodings. The result is a model that can accept raw text instead of requiring preprocessing outside the graph.

The Core Tool: `TextVectorization`

According to the TensorFlow docs, TextVectorization can:

standardize text
tokenize text
learn a vocabulary with adapt()
emit integer sequences or dense representations

That makes it the right first choice for many classification and sequence models.

A Simple End-to-End Example

python

1import tensorflow as tf
2
3texts = tf.constant([
4    "TensorFlow is useful",
5    "Tokenization inside the model is convenient",
6    "Keras preprocessing layers are composable",
7])
8
9labels = tf.constant([1, 0, 1])
10
11vectorizer = tf.keras.layers.TextVectorization(
12    max_tokens=1000,
13    output_mode="int",
14    output_sequence_length=6,
15)
16
17vectorizer.adapt(texts)
18
19model = tf.keras.Sequential([
20    tf.keras.Input(shape=(1,), dtype=tf.string),
21    vectorizer,
22    tf.keras.layers.Embedding(vectorizer.vocabulary_size(), 8),
23    tf.keras.layers.GlobalAveragePooling1D(),
24    tf.keras.layers.Dense(1, activation="sigmoid"),
25])
26
27model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
28model.fit(texts, labels, epochs=1, verbose=0)

This model accepts raw strings, tokenizes them inside the model, converts tokens to indices, and then trains as usual.

The Role of `adapt()`

TextVectorization needs a vocabulary. You can provide one directly, or you can let the layer learn it by calling adapt() on training text.

That training-only detail matters. You should adapt on the training set, not on validation or test data, otherwise you leak information across splits.

Where To Put the Preprocessing

You have two common options:

keep TextVectorization inside the model
run it in a tf.data pipeline before the model

If you want a model that takes raw strings directly, put it inside the model graph as shown above.

If you want preprocessing to happen outside the compiled model, use it in the dataset pipeline:

python

dataset = tf.data.Dataset.from_tensor_slices((texts, labels)).batch(2)
dataset = dataset.map(lambda x, y: (vectorizer(x), y))

Both patterns are valid. The best choice depends on deployment and portability needs.

If you plan to export a model that should accept raw user text directly at serving time, putting the vectorizer inside the model is often the cleaner deployment story.

Custom Standardization and Splitting

You can customize normalization before tokenization. For example:

python

1def custom_standardization(text):
2    text = tf.strings.lower(text)
3    text = tf.strings.regex_replace(text, r"[^a-z0-9 ]", "")
4    return text
5
6
7vectorizer = tf.keras.layers.TextVectorization(
8    standardize=custom_standardization,
9    output_mode="int",
10    output_sequence_length=6,
11)

This is useful when punctuation, HTML fragments, or casing rules need special handling.

When `TextVectorization` Is Not Enough

For some NLP systems, simple tokenization is not the whole story. If you need:

WordPiece or SentencePiece tokenization
transformer-specific token IDs
multilingual or pretrained tokenizers

then you may need TensorFlow Text, KerasNLP, or a model-specific tokenizer outside plain TextVectorization.

Still, for many custom TensorFlow models, TextVectorization is the most direct answer.

Common Pitfalls

The biggest mistake is forgetting to call adapt() when you did not provide a vocabulary manually.

Another mistake is adapting on all available text, including validation and test data.

A third issue is mismatching the vectorizer output with the rest of the model, such as forgetting that an embedding layer expects integer token IDs, not tf-idf vectors.

Summary

Use tf.keras.layers.TextVectorization to add tokenization and basic text preprocessing to a TensorFlow model.
Call adapt() on training text unless you provide a vocabulary explicitly.
You can place the layer inside the model or in the tf.data pipeline.
Custom standardization logic is easy to add when needed.
For transformer-specific tokenization, use a more specialized tokenizer stack.

how to add text preprocessing tokenization step into Tensorflow model

Master System Design with Codemia

Introduction

The Core Tool: TextVectorization

A Simple End-to-End Example

The Role of adapt()

Where To Put the Preprocessing

Custom Standardization and Splitting

When TextVectorization Is Not Enough

Common Pitfalls

Summary

The Core Tool: `TextVectorization`

The Role of `adapt()`

When `TextVectorization` Is Not Enough