Networking & Load Balancing System Design

REST vs gRPC Streaming: When Persistent Connections Help and When They Hurt

April 1, 2026

REST and gRPC are both ways to call a function on another machine, but the wire model is different in ways that matter under load.

REST is request and response over HTTP/1.1 or HTTP/2. Each call is its own transaction. The body is usually JSON, the schema lives in your head or in OpenAPI, and the connection can be reused for the next call but does not have to be.

gRPC runs on HTTP/2, uses protobuf for the body, and ships with code-generated stubs in every major language. It supports four call modes:

Unary: one request, one response. The REST equivalent.
Server streaming: one request, many responses.
Client streaming: many requests, one response.
Bidirectional streaming: many requests, many responses, both directions concurrent.

Streaming sounds like a free upgrade, but the cost is hidden. When you have a short response, the per-request overhead of REST and unary gRPC is similar. Header compression and HTTP/2 multiplexing give gRPC a small edge, but the protobuf encode plus stub overhead eats most of it.

Where streaming actually wins is multi-message workloads on a persistent connection. Three good fits:

Long-running computation that emits progress updates.
Telemetry feeds where the server pushes new data as it arrives.
Chat or collaborative editing where both sides need to send asynchronously.

Now the production failure. A team built a chat backend on gRPC bidirectional streaming. They put it behind an L7 load balancer that did per-request load balancing, but a single bidi stream is a single HTTP/2 stream, which is a single request from the LB's perspective. Every stream pinned to one backend pod for its entire lifetime.

That worked fine until the first deploy. The LB drained the targeted pod by closing new connections. The existing streams stayed open and continued routing to the draining pod. When the pod's terminationGracePeriodSeconds hit, the kernel killed it and every chat session on that pod dropped at the same instant. Users saw "reconnecting" toasts in waves of thousands.

The fix has two halves. Use client-side load balancing so the gRPC client maintains connections to every backend and round-robins streams across them. And run a streaming-aware proxy that participates in graceful drain by sending GOAWAY frames with enough lead time for clients to reconnect to a fresh pod.

Three questions to ask before choosing streaming:

Is your response actually multi-message, or are you reaching for streaming for taste reasons?
Does your LB do connection-level or request-level balancing?
How will a deploy drain the connections without dropping every active stream at once?

Key takeaway

gRPC streaming wins for long-running, multi-message workloads. It loses for short responses, and it breaks badly when a request-aware L7 LB pins every stream to one backend that later drains.

Originally posted on LinkedIn. View original.