TensorFlow
tfrecords
data appending
machine learning
Python

How to append data to TensorFlow tfrecords file

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TFRecord files are sequential binary record streams, and the TensorFlow Python APIs are designed primarily for writing new files, not for safe in-place append workflows. In practice, the usual solutions are either to write a new shard for new data, or to rebuild the dataset into a new TFRecord file.

Understand the Practical Limitation

A TFRecord is not a database table with update and append semantics. It is a record stream. That means the safest pattern is usually:

  • write data once
  • treat the file as immutable
  • create additional shards instead of reopening the same file for ad hoc appends

Creating a record looks like this:

python
1import tensorflow as tf
2
3
4def make_example(text: str, label: int) -> bytes:
5    example = tf.train.Example(
6        features=tf.train.Features(
7            feature={
8                "text": tf.train.Feature(bytes_list=tf.train.BytesList(value=[text.encode()])),
9                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
10            }
11        )
12    )
13    return example.SerializeToString()

Writing a file is straightforward:

python
with tf.io.TFRecordWriter("data-0001.tfrecord") as writer:
    writer.write(make_example("alpha", 1))
    writer.write(make_example("beta", 0))

The difficulty begins when you want to “append” later.

Prefer Writing Another Shard

For most pipelines, the best append strategy is not to modify the old file at all. Write the new records into a new shard and read both shards as one dataset later.

python
with tf.io.TFRecordWriter("data-0002.tfrecord") as writer:
    writer.write(make_example("gamma", 1))
    writer.write(make_example("delta", 0))

Then load all shards together:

python
1dataset = tf.data.TFRecordDataset([
2    "data-0001.tfrecord",
3    "data-0002.tfrecord",
4])
5
6for record in dataset.take(2):
7    print(len(record.numpy()))

This is usually the most robust design because it keeps old data immutable and makes incremental dataset growth easy to reason about.

Rebuild When One Output File Is Required

If you absolutely need one final TFRecord file, read the existing file, write its contents into a new output file, then write the new records after it.

python
1import tensorflow as tf
2
3source_file = "data-0001.tfrecord"
4merged_file = "data-merged.tfrecord"
5
6new_records = [
7    make_example("gamma", 1),
8    make_example("delta", 0),
9]
10
11with tf.io.TFRecordWriter(merged_file) as writer:
12    for raw in tf.data.TFRecordDataset([source_file]):
13        writer.write(raw.numpy())
14
15    for record in new_records:
16        writer.write(record)

This is not a true in-place append. It is a controlled rewrite, which is much safer.

Keep the Schema Consistent

Appending data conceptually only works if the serialized examples follow the same feature schema. If the new records add or remove features unexpectedly, downstream parsing may start failing.

A consistent parse function helps enforce that:

python
1feature_spec = {
2    "text": tf.io.FixedLenFeature([], tf.string),
3    "label": tf.io.FixedLenFeature([], tf.int64),
4}
5
6
7def parse_record(raw_record):
8    return tf.io.parse_single_example(raw_record, feature_spec)

If older and newer records differ structurally, you should treat that as a format migration problem, not as a simple append.

Sharding Is Usually Better Than Mutation

Operationally, sharding has major advantages:

  • safer incremental writes
  • easier parallel reads
  • clearer failure recovery
  • less risk of corrupting one monolithic file

That is why many TensorFlow pipelines naturally produce many TFRecord files instead of one growing file.

Common Pitfalls

  • Expecting TFRecord to support database-like in-place append behavior leads to brittle workflows.
  • Rewriting a file without preserving the exact serialized bytes can accidentally change schema or ordering expectations.
  • Appending records with a different feature layout breaks downstream parsing.
  • Keeping one giant TFRecord file instead of sharding makes maintenance and recovery harder.
  • Treating immutability as a limitation instead of as a design advantage often leads to worse data-pipeline behavior.

Summary

  • TFRecord workflows are safest when files are treated as immutable record streams.
  • The usual “append” strategy is to write a new shard and read multiple shards together.
  • If one final file is required, rebuild into a new file rather than mutating the old one in place.
  • Keep the feature schema consistent across old and new records.
  • Prefer sharding for incremental dataset growth and operational simplicity.

Course illustration
Course illustration

All Rights Reserved.