Create a custom federated data set in TensorFlow Federated
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In TensorFlow Federated, a federated dataset is not just one large tf.data.Dataset. It is a client-partitioned data source where each client has its own local dataset. The practical question is how to map client identifiers to TensorFlow datasets in a form that TFF can consume during simulation and training.
The Core Idea: Client-Partitioned Data
TFF simulation APIs revolve around the idea that each client owns its own examples. That is why the key abstraction is a client-data object that can do two things:
- list client ids
- create a
tf.data.Datasetfor one client
This matches the federated-learning assumption that data is distributed rather than centrally shuffled into one global dataset.
A Small In-Memory Example
For toy experiments, a small in-memory structure is enough to model client-partitioned data.
This is convenient for experiments and tests, but it is not the right tool for large realistic datasets because everything lives in memory.
Preprocess Each Client Dataset
Once you have client-partitioned data, you usually preprocess each client's dataset into the batched structure expected by the learning process.
The important rule is structural consistency. After preprocessing, every client should yield the same keys, dtypes, and tensor ranks.
File-Backed or Generated Client Datasets
For more realistic simulations, the usual pattern is to define a function that maps a client id to a TensorFlow dataset.
This approach scales better conceptually because the client datasets are created on demand instead of being materialized all at once in a big Python structure.
Build Federated Training Input
During simulation, a federated training round usually receives a list of client datasets for the selected clients.
That list becomes the client-side input to one round of federated training.
Keep the Dataset Function Simple and Serializable
When using a dataset function, keep it simple. Capturing arbitrary Python state or returning inconsistent structures across clients can make TFF computations fail in ways that are hard to debug.
The safest pattern is:
- one clear client id
- one deterministic dataset builder
- one consistent preprocessing pipeline
That keeps the federated type signature stable across all clients.
Common Pitfalls
Trying to create a federated dataset without a clear client-to-dataset mapping misses the central idea of federated learning.
Using in-memory toy helpers for large realistic datasets makes experiments harder to scale and maintain.
Letting different clients produce different post-preprocessing structures will usually break TFF type expectations.
Capturing complex Python state inside the dataset function can make serialization or repeated simulation behavior unreliable.
Treating federated data as though it were one central shuffled dataset leads to the wrong mental model and the wrong code structure.
Summary
- A TFF federated dataset is client-partitioned, not one global dataset.
- The core abstraction maps client ids to
tf.data.Datasetobjects. - Small in-memory examples are fine for experiments, but larger simulations need on-demand client dataset creation.
- Preprocess every client into one consistent batched structure.
- Keep dataset creation deterministic and structurally aligned across clients.

