TensorFlow
Model Compression
Machine Learning
AI Optimization
Deep Learning

Compress a TensorFlow model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Model compression reduces size, memory footprint, and inference latency, which is critical for edge and mobile deployment. TensorFlow supports multiple compression techniques with different accuracy and complexity tradeoffs. The best results come from measuring baseline performance first, then applying the minimum compression needed for your target.

Baseline and Export First

Before compression, save a clear baseline model and metrics.

python
1import tensorflow as tf
2
3model = tf.keras.Sequential([
4    tf.keras.layers.Input(shape=(28, 28, 1)),
5    tf.keras.layers.Conv2D(16, 3, activation='relu'),
6    tf.keras.layers.Flatten(),
7    tf.keras.layers.Dense(10, activation='softmax')
8])
9
10model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
11model.save('baseline_model')

Record file size and validation accuracy so you can compare every compression step objectively.

Post-Training Quantization

Post-training quantization is usually the easiest first optimization.

python
1import tensorflow as tf
2
3converter = tf.lite.TFLiteConverter.from_saved_model('baseline_model')
4converter.optimizations = [tf.lite.Optimize.DEFAULT]
5tflite_model = converter.convert()
6
7with open('model_int8_dynamic.tflite', 'wb') as f:
8    f.write(tflite_model)

This often provides major size reduction with small effort.

For full integer quantization, provide representative data.

python
1import numpy as np
2
3
4def representative_data_gen():
5    for _ in range(100):
6        yield [np.random.rand(1, 28, 28, 1).astype('float32')]
7
8converter = tf.lite.TFLiteConverter.from_saved_model('baseline_model')
9converter.optimizations = [tf.lite.Optimize.DEFAULT]
10converter.representative_dataset = representative_data_gen
11converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
12converter.inference_input_type = tf.int8
13converter.inference_output_type = tf.int8
14int8_model = converter.convert()

Representative samples should match production input distribution.

Pruning During Training

Pruning removes low-importance weights and can improve compressibility.

python
1import tensorflow_model_optimization as tfmot
2
3prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
4
5pruned_model = prune_low_magnitude(
6    model,
7    pruning_schedule=tfmot.sparsity.keras.PolynomialDecay(
8        initial_sparsity=0.0,
9        final_sparsity=0.5,
10        begin_step=0,
11        end_step=1000
12    )
13)
14
15pruned_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

After training, strip pruning wrappers before export.

Knowledge Distillation

Distillation trains a smaller student model to mimic a larger teacher model. This can preserve accuracy better than naive downsizing.

Typical flow:

  • train strong teacher
  • train small student on teacher soft targets
  • export student with quantization

Distillation adds training complexity but can produce better size and accuracy balance.

Measure Latency and Accuracy Together

Compression decisions should be driven by deployment metrics, not only model file size.

Evaluate:

  • on-device inference latency
  • peak memory usage
  • model size on disk
  • task-specific accuracy and error profile

A smaller model is not useful if it misses required accuracy thresholds.

Deployment Format Choices

For mobile and embedded targets, TFLite is typical. For server inference, compressed SavedModel may still be acceptable depending on serving stack. Choose format based on runtime, not only conversion convenience.

Also verify operator compatibility early to avoid late conversion failures.

Common Pitfalls

  • Applying aggressive quantization without representative data validation.
  • Comparing compressed model against no baseline metrics.
  • Assuming pruning alone always speeds up inference.
  • Ignoring operator support limits for target runtime.
  • Optimizing only model size while neglecting accuracy impact.

Summary

  • Start with baseline metrics and a reproducible export pipeline.
  • Use post-training quantization as first compression step.
  • Apply pruning and distillation when further reduction is needed.
  • Measure latency, memory, and accuracy together.
  • Validate compressed models on real target hardware before release.

Evaluation Harness Example

A practical deployment workflow keeps a small benchmark harness that runs baseline and compressed models on the same validation batch and hardware target. Record latency percentiles, memory usage, and accuracy deltas in one report so tradeoffs are visible to product and infrastructure teams.

Automated comparison gates in CI can block model artifacts that exceed latency budgets or drop below minimum quality thresholds.

Rollback Planning

Keep previous model artifact available for instant rollback if compressed variant causes unexpected behavior in production traffic. A fast rollback path is as important as compression itself because real-world data can expose edge cases missed in offline evaluation.


Course illustration
Course illustration

All Rights Reserved.