Data Processing
Non-supercomputer Techniques
Big Data
Data Management
Computational Methods

General techniques to work with huge amounts of data on a non-super computer

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

You can process large datasets on normal hardware if you design for memory limits, disk I O, and algorithmic complexity. Most failures come from trying to load everything at once or using tools in default modes that assume abundant resources. A practical workflow emphasizes streaming, chunking, selective columns, staged computation, and choosing data types that do not waste memory.

Start with Data Shape and Cost Estimates

Before writing code, estimate:

  • Row count and average row size.
  • Columns actually required.
  • Expected join sizes.
  • Sort and group by cardinality.

A rough estimate quickly shows whether full in memory processing is feasible. If data size approaches memory capacity, switch to chunk based or out of core processing early.

Prefer Columnar and Compressed Storage

Storage format has major performance impact. CSV is convenient but expensive for parsing and disk bandwidth. Columnar formats like Parquet or ORC often reduce read time and memory pressure because you can read only needed columns.

python
1import pandas as pd
2
3use_cols = ["user_id", "event_time", "value"]
4df = pd.read_parquet("events.parquet", columns=use_cols)
5print(df.head())

Pruning columns at read time prevents unnecessary memory growth.

Use Chunking and Streaming for Tabular Data

When files are too large, process fixed size chunks and aggregate results incrementally.

python
1import pandas as pd
2from collections import Counter
3
4counts = Counter()
5
6for chunk in pd.read_csv("events.csv", chunksize=200_000, usecols=["country"]):
7    counts.update(chunk["country"].dropna().tolist())
8
9print(counts.most_common(10))

This pattern is robust and works on commodity laptops for many analytics tasks.

Streaming is not only for files. The same idea applies to APIs, message queues, and database cursors. If the source can deliver records incrementally, design the consumer to aggregate incrementally too.

Reduce Early, Not Late

Push filters and projections as early as possible. Every unnecessary row and column carried forward multiplies downstream cost.

Good order of operations:

  1. Select required columns.
  2. Apply restrictive filters.
  3. Convert data types where helpful.
  4. Aggregate before expensive joins when possible.

This often gives bigger gains than low level micro optimization.

Use Indexes and Query Engines Wisely

If workload is query heavy, moving part of the pipeline to SQLite, DuckDB, or Postgres can outperform pure in memory scripts. Query planners and indexes are effective for repeated selective queries.

DuckDB example on local files:

python
1import duckdb
2
3con = duckdb.connect()
4result = con.execute("""
5    SELECT country, COUNT(*) AS n
6    FROM read_parquet('events.parquet')
7    WHERE event_time >= '2026-01-01'
8    GROUP BY country
9    ORDER BY n DESC
10    LIMIT 10
11""").fetchall()
12
13print(result)

This avoids loading full files into Python objects unnecessarily.

External Sort and Merge Strategies

Global sorting can exceed memory quickly. Use external sort patterns:

  1. Read manageable chunks.
  2. Sort each chunk.
  3. Write sorted chunks to disk.
  4. Merge streams lazily.

This is slower than in memory sort but scales beyond RAM limits. The same principle applies to deduplication and large joins.

Parallelism on Regular Machines

Parallelism helps only when bottlenecks match the approach:

  • CPU bound transformations can benefit from multiprocessing.
  • I O bound tasks may benefit from async or threaded reads.
  • Too many workers can increase memory pressure and reduce throughput.

Benchmark with realistic data before scaling worker counts.

Memory reduction can be as important as algorithm choice. Converting repetitive text columns to categorical values, narrowing numeric dtypes where safe, and dropping temporary columns early can turn a failing pipeline into a stable one.

Monitor Resource Usage Continuously

Measure memory, CPU, and I O while developing. Lightweight monitoring can reveal regressions early.

bash
/usr/bin/time -l python pipeline.py

Track peak memory and execution time per stage. Performance tuning without measurement usually wastes effort.

Common Pitfalls

  • Loading entire datasets when chunked processing is sufficient.
  • Using row oriented text formats for repeated analytical queries.
  • Delaying filters until late pipeline stages.
  • Running large joins without understanding key cardinality.
  • Increasing parallel workers without measuring memory and I O impact.

Summary

  • Large data on ordinary hardware is feasible with the right processing model.
  • Choose efficient storage formats and read only required columns.
  • Use chunked and streaming workflows to stay within memory limits.
  • Push filtering and aggregation early to reduce downstream work.
  • Measure resource usage and tune based on evidence, not assumptions.

Course illustration
Course illustration

All Rights Reserved.