Data Compression
Algorithms
Coding Theory
Lossless Compression
Data Science

Data Compression Algorithms

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Data compression algorithms reduce the number of bits needed to store or transmit information. The right algorithm depends on what kind of data you have, whether exact recovery is required, and whether you care more about compression ratio, speed, or simplicity.

Lossless vs. Lossy Compression

Compression techniques fall into two broad categories.

Lossless compression preserves the original data exactly after decompression. It is used for text, source code, executables, logs, and many structured data formats.

Lossy compression discards some information in exchange for much smaller outputs. It is common for images, audio, and video where perfect reconstruction is not always necessary.

A small Python example of lossless compression with zlib shows the basic idea:

python
1import zlib
2
3data = b"AAAAABBBBBCCCCCDDDDDEEEEE" * 100
4compressed = zlib.compress(data, level=9)
5restored = zlib.decompress(compressed)
6
7print("original:", len(data))
8print("compressed:", len(compressed))
9print("restored matches:", restored == data)

Because the input contains repeated patterns, the compressor can represent it more compactly.

Common Lossless Techniques

Many practical lossless algorithms combine a few core ideas:

  • Run-length encoding for repeated symbols
  • Dictionary methods such as LZ77 or LZW for repeated substrings
  • Entropy coding such as Huffman coding for frequent symbols

Formats such as ZIP, gzip, and PNG use combinations of these concepts rather than a single isolated trick.

Here is a quick gzip example using Python's standard library:

python
1import gzip
2
3text = ("error: invalid token\n" * 1000).encode("utf-8")
4
5with gzip.open("sample.txt.gz", "wb") as f:
6    f.write(text)
7
8with gzip.open("sample.txt.gz", "rb") as f:
9    restored = f.read()
10
11print(restored == text)

The data returns exactly, which is why gzip is suitable for logs and documents.

Common Lossy Techniques

Lossy compression works differently. It tries to remove information that is less important to human perception or that can be approximated acceptably.

Examples include:

  • JPEG for photographic images
  • MP3 or AAC for audio
  • H.264 and H.265 for video

These formats often transform the signal into another domain, quantize the result, and then entropy-code what remains. The outcome is smaller, but not mathematically identical to the original.

That tradeoff is acceptable for many media workflows but unacceptable for source code or financial records.

Choosing an Algorithm

A useful decision process is:

  • Use lossless compression when exact reconstruction matters.
  • Use lossy compression when size matters more than perfect recovery.
  • Measure both ratio and speed, because smaller files are not always worth slow encoding or decoding.

For example:

  • Text files often compress extremely well with gzip or zstd.
  • Small random data may barely compress at all.
  • Already-compressed JPEG images usually do not benefit much from another general-purpose compressor.

Compression is not just an algorithm problem. It is a data-shape problem.

Compression Ratio vs. Throughput

Developers often focus only on compression ratio, but speed matters too. A backup system, API, or streaming pipeline may prefer slightly larger output if it can compress and decompress much faster.

That is why real systems benchmark algorithms on their own workloads. A compression method that looks excellent on one dataset may be disappointing on another.

Common Pitfalls

The biggest mistake is using lossy compression on data that must round-trip exactly. Once discarded, the lost information cannot be recovered.

Another issue is expecting all data to compress well. High-entropy or already-compressed data often resists further compression, and repeated recompression can even increase size slightly because of format overhead.

Developers also sometimes compare algorithms without considering decode speed. A strong compression ratio is useful, but slow decompression can be painful for user-facing systems.

Finally, do not assume a compression algorithm is universally best. The right answer depends on data type, latency budget, storage cost, and compatibility with the surrounding system.

Summary

  • Data compression reduces storage or transmission cost by representing data more compactly.
  • Lossless compression preserves exact data; lossy compression trades accuracy for size.
  • Practical algorithms combine ideas such as dictionary coding and entropy coding.
  • Choose algorithms based on your data type and operational constraints, not on reputation alone.
  • Measure both compression ratio and speed on realistic workloads.

Course illustration
Course illustration

All Rights Reserved.