How to append data to TensorFlow tfrecords file

TensorFlow

tfrecords

data appending

machine learning

Python

How to append data to TensorFlow tfrecords file

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

TFRecord files are sequential binary record streams, and the TensorFlow Python APIs are designed primarily for writing new files, not for safe in-place append workflows. In practice, the usual solutions are either to write a new shard for new data, or to rebuild the dataset into a new TFRecord file.

Understand the Practical Limitation

A TFRecord is not a database table with update and append semantics. It is a record stream. That means the safest pattern is usually:

write data once
treat the file as immutable
create additional shards instead of reopening the same file for ad hoc appends

Creating a record looks like this:

python

1import tensorflow as tf
2
3
4def make_example(text: str, label: int) -> bytes:
5    example = tf.train.Example(
6        features=tf.train.Features(
7            feature={
8                "text": tf.train.Feature(bytes_list=tf.train.BytesList(value=[text.encode()])),
9                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
10            }
11        )
12    )
13    return example.SerializeToString()

Writing a file is straightforward:

python

with tf.io.TFRecordWriter("data-0001.tfrecord") as writer:
    writer.write(make_example("alpha", 1))
    writer.write(make_example("beta", 0))

The difficulty begins when you want to “append” later.

Prefer Writing Another Shard

For most pipelines, the best append strategy is not to modify the old file at all. Write the new records into a new shard and read both shards as one dataset later.

python

with tf.io.TFRecordWriter("data-0002.tfrecord") as writer:
    writer.write(make_example("gamma", 1))
    writer.write(make_example("delta", 0))

Then load all shards together:

python

1dataset = tf.data.TFRecordDataset([
2    "data-0001.tfrecord",
3    "data-0002.tfrecord",
4])
5
6for record in dataset.take(2):
7    print(len(record.numpy()))

This is usually the most robust design because it keeps old data immutable and makes incremental dataset growth easy to reason about.

Rebuild When One Output File Is Required

If you absolutely need one final TFRecord file, read the existing file, write its contents into a new output file, then write the new records after it.

python

1import tensorflow as tf
2
3source_file = "data-0001.tfrecord"
4merged_file = "data-merged.tfrecord"
5
6new_records = [
7    make_example("gamma", 1),
8    make_example("delta", 0),
9]
10
11with tf.io.TFRecordWriter(merged_file) as writer:
12    for raw in tf.data.TFRecordDataset([source_file]):
13        writer.write(raw.numpy())
14
15    for record in new_records:
16        writer.write(record)

This is not a true in-place append. It is a controlled rewrite, which is much safer.

Keep the Schema Consistent

Appending data conceptually only works if the serialized examples follow the same feature schema. If the new records add or remove features unexpectedly, downstream parsing may start failing.

A consistent parse function helps enforce that:

python

1feature_spec = {
2    "text": tf.io.FixedLenFeature([], tf.string),
3    "label": tf.io.FixedLenFeature([], tf.int64),
4}
5
6
7def parse_record(raw_record):
8    return tf.io.parse_single_example(raw_record, feature_spec)

If older and newer records differ structurally, you should treat that as a format migration problem, not as a simple append.

Sharding Is Usually Better Than Mutation

Operationally, sharding has major advantages:

safer incremental writes
easier parallel reads
clearer failure recovery
less risk of corrupting one monolithic file

That is why many TensorFlow pipelines naturally produce many TFRecord files instead of one growing file.

Common Pitfalls

Expecting TFRecord to support database-like in-place append behavior leads to brittle workflows.
Rewriting a file without preserving the exact serialized bytes can accidentally change schema or ordering expectations.
Appending records with a different feature layout breaks downstream parsing.
Keeping one giant TFRecord file instead of sharding makes maintenance and recovery harder.
Treating immutability as a limitation instead of as a design advantage often leads to worse data-pipeline behavior.

Summary

TFRecord workflows are safest when files are treated as immutable record streams.
The usual “append” strategy is to write a new shard and read multiple shards together.
If one final file is required, rebuild into a new file rather than mutating the old one in place.
Keep the feature schema consistent across old and new records.
Prefer sharding for incremental dataset growth and operational simplicity.