TensorFlow Federated
Federated Learning
Custom Data Set
Machine Learning
Data Privacy

Create a custom federated data set in TensorFlow Federated

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In TensorFlow Federated, a federated dataset is not just one large tf.data.Dataset. It is a client-partitioned data source where each client has its own local dataset. The practical question is how to map client identifiers to TensorFlow datasets in a form that TFF can consume during simulation and training.

The Core Idea: Client-Partitioned Data

TFF simulation APIs revolve around the idea that each client owns its own examples. That is why the key abstraction is a client-data object that can do two things:

  • list client ids
  • create a tf.data.Dataset for one client

This matches the federated-learning assumption that data is distributed rather than centrally shuffled into one global dataset.

A Small In-Memory Example

For toy experiments, a small in-memory structure is enough to model client-partitioned data.

python
1import collections
2import numpy as np
3import tensorflow_federated as tff
4
5client_tensors = collections.OrderedDict({
6    "client_1": {
7        "x": np.array([[1.0], [2.0]], dtype=np.float32),
8        "y": np.array([0, 1], dtype=np.int32),
9    },
10    "client_2": {
11        "x": np.array([[3.0], [4.0], [5.0]], dtype=np.float32),
12        "y": np.array([1, 0, 1], dtype=np.int32),
13    },
14})
15
16client_data = tff.simulation.datasets.TestClientData(client_tensors)
17
18print(client_data.client_ids)
19print(client_data.create_tf_dataset_for_client("client_1").element_spec)

This is convenient for experiments and tests, but it is not the right tool for large realistic datasets because everything lives in memory.

Preprocess Each Client Dataset

Once you have client-partitioned data, you usually preprocess each client's dataset into the batched structure expected by the learning process.

python
1import collections
2import tensorflow as tf
3
4
5def preprocess(dataset):
6    def map_fn(example):
7        return collections.OrderedDict(
8            x=tf.cast(example["x"], tf.float32),
9            y=tf.cast(example["y"], tf.int32),
10        )
11
12    return dataset.shuffle(10).batch(2).map(map_fn).prefetch(tf.data.AUTOTUNE)
13
14
15preprocessed_client_data = client_data.preprocess(preprocess)

The important rule is structural consistency. After preprocessing, every client should yield the same keys, dtypes, and tensor ranks.

File-Backed or Generated Client Datasets

For more realistic simulations, the usual pattern is to define a function that maps a client id to a TensorFlow dataset.

python
1import tensorflow as tf
2import tensorflow_federated as tff
3
4client_ids = ["client_1", "client_2"]
5
6
7def make_dataset_for_client(client_id):
8    if client_id == "client_1":
9        return tf.data.Dataset.from_tensor_slices({
10            "x": [1.0, 2.0],
11            "y": [0, 1],
12        })
13    return tf.data.Dataset.from_tensor_slices({
14        "x": [3.0, 4.0, 5.0],
15        "y": [1, 0, 1],
16    })
17
18
19client_data = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(
20    client_ids=client_ids,
21    serializable_dataset_fn=make_dataset_for_client,
22)

This approach scales better conceptually because the client datasets are created on demand instead of being materialized all at once in a big Python structure.

Build Federated Training Input

During simulation, a federated training round usually receives a list of client datasets for the selected clients.

python
1sampled_client_ids = preprocessed_client_data.client_ids[:2]
2
3federated_train_data = [
4    preprocessed_client_data.create_tf_dataset_for_client(client_id)
5    for client_id in sampled_client_ids
6]

That list becomes the client-side input to one round of federated training.

Keep the Dataset Function Simple and Serializable

When using a dataset function, keep it simple. Capturing arbitrary Python state or returning inconsistent structures across clients can make TFF computations fail in ways that are hard to debug.

The safest pattern is:

  • one clear client id
  • one deterministic dataset builder
  • one consistent preprocessing pipeline

That keeps the federated type signature stable across all clients.

Common Pitfalls

Trying to create a federated dataset without a clear client-to-dataset mapping misses the central idea of federated learning.

Using in-memory toy helpers for large realistic datasets makes experiments harder to scale and maintain.

Letting different clients produce different post-preprocessing structures will usually break TFF type expectations.

Capturing complex Python state inside the dataset function can make serialization or repeated simulation behavior unreliable.

Treating federated data as though it were one central shuffled dataset leads to the wrong mental model and the wrong code structure.

Summary

  • A TFF federated dataset is client-partitioned, not one global dataset.
  • The core abstraction maps client ids to tf.data.Dataset objects.
  • Small in-memory examples are fine for experiments, but larger simulations need on-demand client dataset creation.
  • Preprocess every client into one consistent batched structure.
  • Keep dataset creation deterministic and structurally aligned across clients.

Course illustration
Course illustration

All Rights Reserved.