C - How to implement Set data structure?

C programming

Set data structure

Data structures

Coding tutorial

Programming guide

C - How to implement Set data structure?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Implementing a set in C means designing a collection with unique elements and efficient insert, lookup, and delete operations. Because C has no built-in generic containers, you choose a concrete representation based on constraints: hash table for average O(1), balanced tree for ordered iteration, or bitset for dense bounded integer domains.

For most general-purpose integer sets, a hash table with separate chaining is a practical baseline. It is straightforward to implement and performs well with a good hash function and load-factor control.

Core Sections

1. Basic hash-set structure

1#include <stdlib.h>
2#include <stdbool.h>
3
4typedef struct Node {
5    int key;
6    struct Node *next;
7} Node;
8
9typedef struct {
10    Node **buckets;
11    size_t cap;
12    size_t size;
13} IntSet;

2. Hash and initialization

1static size_t hash_int(int x, size_t cap) {
2    return ((unsigned)x * 2654435761u) % cap;
3}
4
5IntSet *set_create(size_t cap) {
6    IntSet *s = malloc(sizeof(IntSet));
7    s->buckets = calloc(cap, sizeof(Node*));
8    s->cap = cap;
9    s->size = 0;
10    return s;
11}

3. Insert and contains

1bool set_contains(IntSet *s, int key) {
2    size_t idx = hash_int(key, s->cap);
3    for (Node *n = s->buckets[idx]; n; n = n->next) {
4        if (n->key == key) return true;
5    }
6    return false;
7}
8
9bool set_insert(IntSet *s, int key) {
10    if (set_contains(s, key)) return false;
11    size_t idx = hash_int(key, s->cap);
12    Node *n = malloc(sizeof(Node));
13    n->key = key;
14    n->next = s->buckets[idx];
15    s->buckets[idx] = n;
16    s->size++;
17    return true;
18}

4. Remove and cleanup

1bool set_remove(IntSet *s, int key) {
2    size_t idx = hash_int(key, s->cap);
3    Node **cur = &s->buckets[idx];
4    while (*cur) {
5        if ((*cur)->key == key) {
6            Node *tmp = *cur;
7            *cur = (*cur)->next;
8            free(tmp);
9            s->size--;
10            return true;
11        }
12        cur = &(*cur)->next;
13    }
14    return false;
15}
16
17void set_destroy(IntSet *s) {
18    for (size_t i = 0; i < s->cap; i++) {
19        Node *n = s->buckets[i];
20        while (n) {
21            Node *next = n->next;
22            free(n);
23            n = next;
24        }
25    }
26    free(s->buckets);
27    free(s);
28}

5. Resize for performance

When load factor (size/cap) grows too high (for example > 0.75), rehash into a larger table to preserve near O(1) average performance.

Common Pitfalls

Forgetting uniqueness checks and accidentally storing duplicates.
Using poor hash functions and causing heavy bucket collisions.
Not resizing as load factor grows, degrading performance to O(n).
Leaking memory by skipping node cleanup on remove or destroy.
Failing to handle edge cases like zero-capacity initialization safely.

Summary

A hash-based set in C is an efficient and practical data structure for unique element storage. Define clear ownership, implement insert/contains/remove correctly, and manage memory carefully. Add load-factor-based resizing to keep operations fast. With these foundations, you can extend to generic keys, custom hash/equality callbacks, or thread-safe variants as needed.

A practical way to keep this issue solved is to convert the guidance into a repeatable runbook that can be executed by anyone on the team. Write down the exact environment assumptions, dependency versions, runtime flags, and validation commands required to confirm the behavior. Include expected outputs for the happy path and one or two known failure signatures so the next engineer can quickly classify what they are seeing. This turns fragile tribal knowledge into an operational artifact that survives handoffs, on-call rotations, and context switches.

It is also useful to add one lightweight automated guardrail in CI so regressions are caught before deployment. The guardrail should target the most failure-prone step in the workflow: an import smoke test, configuration lint, compatibility check, integration probe, or small benchmark assertion. Keep that check fast enough to run on every change and explicit enough that failure messages are actionable. In teams with parallel contributors, early automated detection prevents repeated debugging of the same class of issue.

Finally, keep examples current as tools and frameworks evolve. A command or API that worked six months ago may become deprecated, renamed, or behaviorally different. Treat documentation updates as normal maintenance work, just like test upkeep. When guidance is version-aware and tested regularly, you avoid drift between article recommendations and production reality, and the content remains useful for both new and experienced engineers.