TensorFlow
text preprocessing
tokenization
machine learning
deep learning

how to add text preprocessing tokenization step into Tensorflow model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In TensorFlow, the standard way to put tokenization directly into a model pipeline is to use tf.keras.layers.TextVectorization. This layer can standardize raw strings, split them into tokens, build or accept a vocabulary, and output integer token IDs or other encodings. The result is a model that can accept raw text instead of requiring preprocessing outside the graph.

The Core Tool: TextVectorization

According to the TensorFlow docs, TextVectorization can:

  • standardize text
  • tokenize text
  • learn a vocabulary with adapt()
  • emit integer sequences or dense representations

That makes it the right first choice for many classification and sequence models.

A Simple End-to-End Example

python
1import tensorflow as tf
2
3texts = tf.constant([
4    "TensorFlow is useful",
5    "Tokenization inside the model is convenient",
6    "Keras preprocessing layers are composable",
7])
8
9labels = tf.constant([1, 0, 1])
10
11vectorizer = tf.keras.layers.TextVectorization(
12    max_tokens=1000,
13    output_mode="int",
14    output_sequence_length=6,
15)
16
17vectorizer.adapt(texts)
18
19model = tf.keras.Sequential([
20    tf.keras.Input(shape=(1,), dtype=tf.string),
21    vectorizer,
22    tf.keras.layers.Embedding(vectorizer.vocabulary_size(), 8),
23    tf.keras.layers.GlobalAveragePooling1D(),
24    tf.keras.layers.Dense(1, activation="sigmoid"),
25])
26
27model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
28model.fit(texts, labels, epochs=1, verbose=0)

This model accepts raw strings, tokenizes them inside the model, converts tokens to indices, and then trains as usual.

The Role of adapt()

TextVectorization needs a vocabulary. You can provide one directly, or you can let the layer learn it by calling adapt() on training text.

That training-only detail matters. You should adapt on the training set, not on validation or test data, otherwise you leak information across splits.

Where To Put the Preprocessing

You have two common options:

  • keep TextVectorization inside the model
  • run it in a tf.data pipeline before the model

If you want a model that takes raw strings directly, put it inside the model graph as shown above.

If you want preprocessing to happen outside the compiled model, use it in the dataset pipeline:

python
dataset = tf.data.Dataset.from_tensor_slices((texts, labels)).batch(2)
dataset = dataset.map(lambda x, y: (vectorizer(x), y))

Both patterns are valid. The best choice depends on deployment and portability needs.

If you plan to export a model that should accept raw user text directly at serving time, putting the vectorizer inside the model is often the cleaner deployment story.

Custom Standardization and Splitting

You can customize normalization before tokenization. For example:

python
1def custom_standardization(text):
2    text = tf.strings.lower(text)
3    text = tf.strings.regex_replace(text, r"[^a-z0-9 ]", "")
4    return text
5
6
7vectorizer = tf.keras.layers.TextVectorization(
8    standardize=custom_standardization,
9    output_mode="int",
10    output_sequence_length=6,
11)

This is useful when punctuation, HTML fragments, or casing rules need special handling.

When TextVectorization Is Not Enough

For some NLP systems, simple tokenization is not the whole story. If you need:

  • WordPiece or SentencePiece tokenization
  • transformer-specific token IDs
  • multilingual or pretrained tokenizers

then you may need TensorFlow Text, KerasNLP, or a model-specific tokenizer outside plain TextVectorization.

Still, for many custom TensorFlow models, TextVectorization is the most direct answer.

Common Pitfalls

The biggest mistake is forgetting to call adapt() when you did not provide a vocabulary manually.

Another mistake is adapting on all available text, including validation and test data.

A third issue is mismatching the vectorizer output with the rest of the model, such as forgetting that an embedding layer expects integer token IDs, not tf-idf vectors.

Summary

  • Use tf.keras.layers.TextVectorization to add tokenization and basic text preprocessing to a TensorFlow model.
  • Call adapt() on training text unless you provide a vocabulary explicitly.
  • You can place the layer inside the model or in the tf.data pipeline.
  • Custom standardization logic is easy to add when needed.
  • For transformer-specific tokenization, use a more specialized tokenizer stack.

Course illustration
Course illustration

All Rights Reserved.