Compressing data twice with gzip with a high ratio the second time
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If you take a normal byte stream, gzip it, and then gzip the resulting .gz bytes again, the second pass usually helps little or even makes the result larger. So when a second gzip pass appears to compress well, the interesting question is not “why is gzip magical?” but “what structure remained in the first output that the second pass could still exploit?”
Why a Second Pass Usually Does Not Help
Gzip uses the DEFLATE algorithm, which is designed to remove repeated structure from the input. After a successful first compression, the output is intended to look close to high-entropy data. High-entropy data has little redundancy left for a second pass to discover.
A simple Python example shows the typical behavior:
On normal repetitive input, first becomes much smaller than the original, while second is often slightly larger than first because a new gzip wrapper is added and there is little useful redundancy left.
So Why Would the Second Pass Ever Look Good
There are a few realistic explanations.
The First Data Was Compressed in Independent Chunks
If the first stage compressed many small pieces separately, each piece started with a fresh dictionary. That is less efficient than compressing the whole stream once. When those individually compressed pieces are later combined, a second outer gzip pass may compress repeated headers, repeated block patterns, or repeated metadata.
The First Stage Included Wrapper Overhead
A gzip file is not just raw compressed payload. It contains a header and footer. If you concatenate many gzip members or many similarly structured compressed records, the second pass may compress those repeated wrappers and layout patterns.
The First Stage Was Not Truly the Same Input
Sometimes the comparison is misleading. For example, one pass may have used one compression level, chunk size, or file layout, while the second pass was run on a transformed artifact. In that case, you are not really observing “gzip beats gzip”; you are observing that the first pipeline left structure behind.
A Better Mental Model
Think about compression in terms of redundancy scope. A compressor is only effective over the data region it can see at once. If repeated content is split across many isolated chunks, the first stage may miss global patterns. A second stage that sees the whole assembled byte stream can still exploit cross-chunk repetition.
That does not mean double compression is a good storage strategy. It usually means the first stage was applied at the wrong granularity.
Demonstrating Chunk Effects
This example compresses small blocks separately, concatenates them, and then applies gzip again to the combined byte stream.
You may see the outer pass save some space because it can compress repetition in the repeated member structure. But that is still usually worse than compressing the original uncompressed records as one stream in the first place.
What You Should Do Instead
If your goal is best compression ratio, the usual fix is one of these:
- compress the full dataset once instead of compressing small pieces independently
- use a stronger format such as
xzorzstdif the workload allows it - avoid wrapping already compressed data unless transport or protocol requirements demand it
Double gzip is rarely the right optimization target.
Common Pitfalls
The main pitfall is assuming a good second-pass ratio means repeated compression is generally useful. In most cases, it signals that the first-stage pipeline was structurally inefficient.
Another mistake is measuring only ratio and ignoring CPU cost. Even if the second pass saves a few percent, it may not justify the extra time and complexity.
A third issue is comparing unlike artifacts. If the first and second stages are not operating on equivalent representations, the observed gain does not prove much about gzip itself.
Summary
- A second gzip pass over already gzipped data usually helps little or hurts slightly.
- If it helps a lot, the first stage likely left repeated structure behind.
- Independent chunk compression is a common reason this happens.
- Repeated headers and member structure can be compressible even when the payload is not.
- The better fix is usually to compress once at the right granularity, not to adopt double gzip as a strategy.

