TensorFlow
JSON decoding
String Tensor
data parsing
machine learning

How to decode json string from String Tensor?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Sometimes JSON records reach TensorFlow as tf.string tensors, especially in ingestion pipelines that carry raw text between systems. Decoding them is possible, but the best solution depends on whether you need arbitrary JSON parsing inside the TensorFlow pipeline or whether you can parse upstream and feed structured tensors instead.

Understand the Main Limitation

TensorFlow does not provide a native high-performance graph op for arbitrary free-form JSON parsing in the same way it does for serialized Example records. The usual workaround is tf.py_function, which lets Python parse one string element at a time.

That works, but it has tradeoffs:

  • it executes Python outside normal graph optimization
  • it can reduce throughput in large pipelines
  • it requires you to set output dtypes and shapes explicitly

So the real design question is not only whether TensorFlow can decode the JSON, but whether the parsing belongs inside TensorFlow at all.

Parse JSON With tf.py_function

For flexible parsing of arbitrary JSON objects, tf.py_function is usually the practical tool.

python
1import json
2import tensorflow as tf
3
4raw = tf.constant([
5    b'{"id": 1, "name": "alice", "score": 9.2}',
6    b'{"id": 2, "name": "bob", "score": 7.8}'
7])
8
9
10def parse_one_json(x: tf.Tensor):
11    obj = json.loads(x.numpy().decode("utf-8"))
12    return obj["id"], obj["name"].encode("utf-8"), obj["score"]
13
14
15def tf_parse_json(x: tf.Tensor):
16    id_t, name_t, score_t = tf.py_function(
17        func=parse_one_json,
18        inp=[x],
19        Tout=[tf.int32, tf.string, tf.float32]
20    )
21
22    id_t.set_shape([])
23    name_t.set_shape([])
24    score_t.set_shape([])
25
26    return {"id": id_t, "name": name_t, "score": score_t}
27
28
29ds = tf.data.Dataset.from_tensor_slices(raw).map(tf_parse_json)
30for item in ds:
31    print(item)

The important step is setting shapes after tf.py_function, because downstream layers and dataset transformations often depend on known tensor shapes.

Handle Bad JSON Deliberately

Real data is rarely perfect. Malformed records or missing keys should be handled intentionally instead of crashing the whole pipeline.

python
1import json
2import tensorflow as tf
3
4
5def parse_with_defaults(x: tf.Tensor):
6    try:
7        obj = json.loads(x.numpy().decode("utf-8"))
8        return (
9            int(obj.get("id", -1)),
10            str(obj.get("name", "unknown")).encode("utf-8"),
11            float(obj.get("score", 0.0)),
12        )
13    except Exception:
14        return -1, b"invalid", 0.0
15
16
17def tf_parse_safe(x: tf.Tensor):
18    outputs = tf.py_function(
19        func=parse_with_defaults,
20        inp=[x],
21        Tout=[tf.int32, tf.string, tf.float32]
22    )
23    for tensor in outputs:
24        tensor.set_shape([])
25    return {"id": outputs[0], "name": outputs[1], "score": outputs[2]}

From there, you can filter invalid records or route them to monitoring.

Parse Upstream When Performance Matters

If throughput matters, it is often better to decode JSON before TensorFlow sees it. Then the TensorFlow pipeline works with already-typed values.

python
1import tensorflow as tf
2
3records = [
4    {"id": 1, "name": "alice", "score": 9.2},
5    {"id": 2, "name": "bob", "score": 7.8},
6]
7
8ids = tf.constant([record["id"] for record in records], dtype=tf.int32)
9names = tf.constant([record["name"] for record in records], dtype=tf.string)
10scores = tf.constant([record["score"] for record in records], dtype=tf.float32)
11
12ds = tf.data.Dataset.from_tensor_slices({"id": ids, "name": names, "score": scores})
13for item in ds:
14    print(item)

This keeps the pipeline easier to optimize and easier to reason about.

Know What tf.io.decode_json_example Is For

TensorFlow does provide tf.io.decode_json_example, but it is intended for JSON representations of Example protobuf records, not arbitrary API-style JSON objects.

That means it is the right tool only when your input format already follows the Example schema. For general JSON payloads, tf.py_function or upstream parsing is still the relevant approach.

Common Pitfalls

The biggest mistake is assuming TensorFlow has a native, high-performance parser for arbitrary JSON objects. In most cases, it does not.

Another common issue is forgetting to set output shapes after tf.py_function. Developers also sometimes push highly dynamic JSON schemas into a training pipeline that really wants fixed typed features, which makes the input layer harder to maintain than it needs to be.

Summary

  • Arbitrary JSON in a tf.string tensor is usually parsed with tf.py_function.
  • Always set output dtypes and shapes explicitly after parsing.
  • Handle malformed records intentionally instead of letting one bad line crash the pipeline.
  • Parse upstream when performance and schema stability matter.
  • Use tf.io.decode_json_example only for Example JSON, not generic JSON payloads.

Course illustration
Course illustration

All Rights Reserved.