Compressing data twice with gzip with a high ratio the second time

data compression

gzip

compression techniques

file optimization

double compression

Compressing data twice with gzip with a high ratio the second time

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If you take a normal byte stream, gzip it, and then gzip the resulting .gz bytes again, the second pass usually helps little or even makes the result larger. So when a second gzip pass appears to compress well, the interesting question is not “why is gzip magical?” but “what structure remained in the first output that the second pass could still exploit?”

Why a Second Pass Usually Does Not Help

Gzip uses the DEFLATE algorithm, which is designed to remove repeated structure from the input. After a successful first compression, the output is intended to look close to high-entropy data. High-entropy data has little redundancy left for a second pass to discover.

A simple Python example shows the typical behavior:

python

1import gzip
2
3payload = (b"abcde" * 10000)
4first = gzip.compress(payload)
5second = gzip.compress(first)
6
7print("original:", len(payload))
8print("first gzip:", len(first))
9print("second gzip:", len(second))

On normal repetitive input, first becomes much smaller than the original, while second is often slightly larger than first because a new gzip wrapper is added and there is little useful redundancy left.

So Why Would the Second Pass Ever Look Good

There are a few realistic explanations.

The First Data Was Compressed in Independent Chunks

If the first stage compressed many small pieces separately, each piece started with a fresh dictionary. That is less efficient than compressing the whole stream once. When those individually compressed pieces are later combined, a second outer gzip pass may compress repeated headers, repeated block patterns, or repeated metadata.

The First Stage Included Wrapper Overhead

A gzip file is not just raw compressed payload. It contains a header and footer. If you concatenate many gzip members or many similarly structured compressed records, the second pass may compress those repeated wrappers and layout patterns.

The First Stage Was Not Truly the Same Input

Sometimes the comparison is misleading. For example, one pass may have used one compression level, chunk size, or file layout, while the second pass was run on a transformed artifact. In that case, you are not really observing “gzip beats gzip”; you are observing that the first pipeline left structure behind.

A Better Mental Model

Think about compression in terms of redundancy scope. A compressor is only effective over the data region it can see at once. If repeated content is split across many isolated chunks, the first stage may miss global patterns. A second stage that sees the whole assembled byte stream can still exploit cross-chunk repetition.

That does not mean double compression is a good storage strategy. It usually means the first stage was applied at the wrong granularity.

Demonstrating Chunk Effects

This example compresses small blocks separately, concatenates them, and then applies gzip again to the combined byte stream.

python

1import gzip
2
3records = [b"sensor=17,value=42\n" * 200 for _ in range(200)]
4compressed_records = [gzip.compress(r) for r in records]
5combined = b"".join(compressed_records)
6outer = gzip.compress(combined)
7
8print("combined gzip members:", len(combined))
9print("outer gzip over combined members:", len(outer))

You may see the outer pass save some space because it can compress repetition in the repeated member structure. But that is still usually worse than compressing the original uncompressed records as one stream in the first place.

What You Should Do Instead

If your goal is best compression ratio, the usual fix is one of these:

compress the full dataset once instead of compressing small pieces independently
use a stronger format such as xz or zstd if the workload allows it
avoid wrapping already compressed data unless transport or protocol requirements demand it

Double gzip is rarely the right optimization target.

Common Pitfalls

The main pitfall is assuming a good second-pass ratio means repeated compression is generally useful. In most cases, it signals that the first-stage pipeline was structurally inefficient.

Another mistake is measuring only ratio and ignoring CPU cost. Even if the second pass saves a few percent, it may not justify the extra time and complexity.

A third issue is comparing unlike artifacts. If the first and second stages are not operating on equivalent representations, the observed gain does not prove much about gzip itself.

Summary

A second gzip pass over already gzipped data usually helps little or hurts slightly.
If it helps a lot, the first stage likely left repeated structure behind.
Independent chunk compression is a common reason this happens.
Repeated headers and member structure can be compressible even when the payload is not.
The better fix is usually to compress once at the right granularity, not to adopt double gzip as a strategy.