General techniques to work with huge amounts of data on a non-super computer
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
You can process large datasets on normal hardware if you design for memory limits, disk I O, and algorithmic complexity. Most failures come from trying to load everything at once or using tools in default modes that assume abundant resources. A practical workflow emphasizes streaming, chunking, selective columns, staged computation, and choosing data types that do not waste memory.
Start with Data Shape and Cost Estimates
Before writing code, estimate:
- Row count and average row size.
- Columns actually required.
- Expected join sizes.
- Sort and group by cardinality.
A rough estimate quickly shows whether full in memory processing is feasible. If data size approaches memory capacity, switch to chunk based or out of core processing early.
Prefer Columnar and Compressed Storage
Storage format has major performance impact. CSV is convenient but expensive for parsing and disk bandwidth. Columnar formats like Parquet or ORC often reduce read time and memory pressure because you can read only needed columns.
Pruning columns at read time prevents unnecessary memory growth.
Use Chunking and Streaming for Tabular Data
When files are too large, process fixed size chunks and aggregate results incrementally.
This pattern is robust and works on commodity laptops for many analytics tasks.
Streaming is not only for files. The same idea applies to APIs, message queues, and database cursors. If the source can deliver records incrementally, design the consumer to aggregate incrementally too.
Reduce Early, Not Late
Push filters and projections as early as possible. Every unnecessary row and column carried forward multiplies downstream cost.
Good order of operations:
- Select required columns.
- Apply restrictive filters.
- Convert data types where helpful.
- Aggregate before expensive joins when possible.
This often gives bigger gains than low level micro optimization.
Use Indexes and Query Engines Wisely
If workload is query heavy, moving part of the pipeline to SQLite, DuckDB, or Postgres can outperform pure in memory scripts. Query planners and indexes are effective for repeated selective queries.
DuckDB example on local files:
This avoids loading full files into Python objects unnecessarily.
External Sort and Merge Strategies
Global sorting can exceed memory quickly. Use external sort patterns:
- Read manageable chunks.
- Sort each chunk.
- Write sorted chunks to disk.
- Merge streams lazily.
This is slower than in memory sort but scales beyond RAM limits. The same principle applies to deduplication and large joins.
Parallelism on Regular Machines
Parallelism helps only when bottlenecks match the approach:
- CPU bound transformations can benefit from multiprocessing.
- I O bound tasks may benefit from async or threaded reads.
- Too many workers can increase memory pressure and reduce throughput.
Benchmark with realistic data before scaling worker counts.
Memory reduction can be as important as algorithm choice. Converting repetitive text columns to categorical values, narrowing numeric dtypes where safe, and dropping temporary columns early can turn a failing pipeline into a stable one.
Monitor Resource Usage Continuously
Measure memory, CPU, and I O while developing. Lightweight monitoring can reveal regressions early.
Track peak memory and execution time per stage. Performance tuning without measurement usually wastes effort.
Common Pitfalls
- Loading entire datasets when chunked processing is sufficient.
- Using row oriented text formats for repeated analytical queries.
- Delaying filters until late pipeline stages.
- Running large joins without understanding key cardinality.
- Increasing parallel workers without measuring memory and I O impact.
Summary
- Large data on ordinary hardware is feasible with the right processing model.
- Choose efficient storage formats and read only required columns.
- Use chunked and streaming workflows to stay within memory limits.
- Push filtering and aggregation early to reduce downstream work.
- Measure resource usage and tune based on evidence, not assumptions.

