CSV parsing
tree structure
data visualization
data structure
hierarchical representation

Generate tree structure from csv

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Turning CSV data into a tree structure is a common step when flat records actually describe parent and child relationships. The main work is not tree traversal, but correctly interpreting the CSV schema and linking nodes in a stable way. A good implementation validates parent references, handles missing roots, and keeps construction separate from display logic.

Start with a Clear CSV Model

The simplest tree-oriented CSV has one row per node and at least these columns:

  • node id
  • parent id
  • label or payload

Example:

csv
1id,parent_id,name
21,,Root
32,1,Child A
43,1,Child B
54,2,Grandchild

This format is much easier to convert than a CSV that embeds path strings or indentation levels.

A reliable two-pass approach is:

  1. create all nodes by id
  2. connect each node to its parent
python
1import csv
2from dataclasses import dataclass, field
3
4
5@dataclass
6class Node:
7    node_id: str
8    name: str
9    children: list = field(default_factory=list)
10
11
12def build_tree(path: str):
13    nodes = {}
14    roots = []
15    rows = []
16
17    with open(path, newline="", encoding="utf-8") as f:
18        reader = csv.DictReader(f)
19        for row in reader:
20            rows.append(row)
21            nodes[row["id"]] = Node(node_id=row["id"], name=row["name"])
22
23    for row in rows:
24        node = nodes[row["id"]]
25        parent_id = row["parent_id"]
26
27        if parent_id:
28            nodes[parent_id].children.append(node)
29        else:
30            roots.append(node)
31
32    return roots

This avoids ordering bugs where a child appears before its parent in the file.

Once built, a simple recursive printer makes it easy to verify the structure.

python
1def print_tree(node, indent=0):
2    print("  " * indent + f"- {node.name} ({node.node_id})")
3    for child in node.children:
4        print_tree(child, indent + 1)

Usage:

python
roots = build_tree("tree.csv")
for root in roots:
    print_tree(root)

Separating construction from rendering keeps the code easier to test.

Validate Missing Parents

Real CSV data is often messy. A row may reference a parent id that does not exist.

Modify the linking step defensively:

python
1if parent_id:
2    if parent_id not in nodes:
3        raise ValueError(f"Missing parent id: {parent_id}")
4    nodes[parent_id].children.append(node)
5else:
6    roots.append(node)

Failing early is usually better than silently creating broken trees.

Handle Multiple Roots and Forests

Some CSVs describe a forest rather than a single tree. That is why returning a list of roots is often safer than forcing one artificial root.

If your application needs exactly one root, validate that constraint explicitly instead of assuming it from the data.

Alternative Input Style: Full Paths

Sometimes CSV does not contain parent ids, but full hierarchical paths such as:

  • 'root/a'
  • 'root/b'
  • 'root/a/c'

In that case, build nodes by splitting the path and inserting missing intermediate nodes. That is a different parsing problem, but the resulting tree structure can still use the same node model.

The important part is to design the importer around the real CSV semantics instead of forcing every dataset into one shape.

Detect Cycles and Bad Data

If the CSV is user-generated or imported from other systems, cycles are possible. A cycle makes it impossible to build a valid tree.

At minimum, add checks for:

  • missing parents
  • duplicate ids
  • self-parenting rows

These checks turn subtle recursion bugs into immediate data-quality failures.

Common Pitfalls

  • Linking rows in one pass before all nodes exist.
  • Assuming the CSV always contains exactly one root.
  • Ignoring missing parent ids and silently building broken structures.
  • Mixing tree-building code with printing or UI code.
  • Forgetting to validate duplicate ids or cycle-like relationships.

Summary

  • Use a two-pass approach: create nodes first, then attach children.
  • Keep the CSV schema explicit around ids and parent ids.
  • Return a list of roots so multiple-root datasets are handled naturally.
  • Validate missing parents and duplicate ids early.
  • Separate tree construction from rendering for easier testing and reuse.

Course illustration
Course illustration

All Rights Reserved.