string manipulation
tree representation
data structures
parsing algorithms
programming techniques

Convert string to tree representation with rules

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Turning a string into a tree is a parsing problem: you need a grammar, token boundaries, and rules that say when a node starts, when children begin, and when that subtree ends. Once those rules are explicit, the implementation becomes straightforward with either recursion or a stack.

Define the grammar before writing code

Suppose the input format is:

text
A(B(C,D),E(F,G(H,I)))

One reasonable rule set is:

  • a node name is a run of letters or digits
  • '( starts the child list for the most recent node'
  • ', separates siblings'
  • ') ends the current child list'

With those rules, the tree means:

  • 'A is the root'
  • 'B and E are children of A'
  • 'C and D are children of B'
  • 'F and G are children of E'

If the grammar is ambiguous, no tree builder can save you. The first job is always to make the representation deterministic.

Build the tree with a stack

A stack-based parser works well for parenthesized formats because the stack naturally tracks the current parent chain.

python
1class Node:
2    def __init__(self, value):
3        self.value = value
4        self.children = []
5
6    def __repr__(self):
7        return f"Node({self.value!r}, {self.children!r})"
8
9
10def parse_tree(text: str) -> Node:
11    stack = []
12    current = None
13    root = None
14    i = 0
15
16    while i < len(text):
17        ch = text[i]
18
19        if ch.isalnum():
20            start = i
21            while i < len(text) and text[i].isalnum():
22                i += 1
23            name = text[start:i]
24            node = Node(name)
25
26            if current is not None:
27                current.children.append(node)
28            else:
29                root = node
30
31            current = node
32            continue
33
34        if ch == '(':
35            stack.append(current)
36        elif ch == ',':
37            current = stack[-1]
38        elif ch == ')':
39            current = stack.pop()
40
41        i += 1
42
43    if root is None:
44        raise ValueError("Input does not contain a root node")
45
46    return root
47
48
49tree = parse_tree("A(B(C,D),E(F,G(H,I)))")
50print(tree)

This parser succeeds because every symbol has one clear job. It does not guess what commas or parentheses mean after the fact.

Recursive parsing is often cleaner for formal grammars

If the string format is more language-like, recursive descent can be easier to maintain. For example, a grammar might be expressed conceptually as:

  • 'Node := Name [ '(' Children ')' ]'
  • 'Children := Node ( ',' Node )*'

That style maps well to functions such as parse_node() and parse_children(). The stack approach is still valid, but recursion tends to mirror the grammar more directly when the language becomes more complex.

Add validation, not just construction

A robust parser should reject malformed input instead of building a misleading partial tree. Important validations include:

  • unmatched parentheses
  • empty node names
  • repeated separators such as A(,B)
  • trailing junk after the root

For example, after parsing a full root, you should verify that the whole string was consumed. Otherwise, an input like A(B)XYZ might silently produce the wrong result.

It also helps to add a serializer so you can round-trip the structure during tests:

python
1def to_dict(node: Node):
2    return {
3        "value": node.value,
4        "children": [to_dict(child) for child in node.children],
5    }
6
7
8print(to_dict(tree))

That makes it easy to assert the parsed structure in unit tests.

Common Pitfalls

The most common mistake is trying to parse the string without first defining the format precisely. If commas, spaces, or parentheses are allowed in multiple roles, the tree shape becomes ambiguous.

Another mistake is storing only one "current node" without a parent stack. That works for flat input but breaks as soon as nesting is more than one level deep.

Developers also forget error handling. A parser that silently accepts malformed input is worse than one that throws a clear exception, because downstream code may trust a corrupted tree.

Finally, do not mix tokenization and semantics carelessly. Once names can contain more than one character, character-by-character logic without a tokenizer often becomes fragile.

Summary

  • Converting a string to a tree is a parsing problem, so define the grammar first.
  • Parenthesized formats are easy to implement with a stack or recursive descent.
  • Each symbol in the input should have one unambiguous structural meaning.
  • Add validation for malformed input, not just happy-path construction.
  • Test the parser with a serializer or structural assertions so tree shape stays correct.

Course illustration
Course illustration

All Rights Reserved.