gRPC Internals

gRPC is the RPC framework that finally made HTTP/2 useful for backend services. It pairs Protocol Buffers — a tiny, schema-driven binary wire format — with HTTP/2's multiplexed streams, then layers on deadlines, cancellation, retries, name resolution, and load balancing. The whole thing is designed for what REST never solved well: high-throughput, low-latency, strongly-typed service-to-service communication, with first-class bidirectional streaming and language-agnostic code generation. It's open-sourced from Google's internal Stubby and now powers internal RPC at Netflix, Square, Dropbox, Lyft, Cisco, and roughly every Kubernetes control plane (kube-apiserver, etcd, containerd, CRI, CSI, CNI all speak gRPC).

The reference implementation is grpc-c-core (used by C++, Python, Ruby, PHP, C#, Objective-C), with separate native implementations in Go (grpc-go) and Java (grpc-java).

gRPC Architecture Overview

Key Numbers

Wire encoding

Protobuf (binary)

Transport

HTTP/2 only

Streaming modes

Message frame prefix

5 bytes

Max msg (default)

4 MiB

Status code domain

17 codes

Default keepalive ping

2 hours

Why gRPC Exists

REST Was Bloated

A typical JSON REST call carries 10x more bytes on the wire than the same data in protobuf, because every field name appears as a string in every payload. Backend-to-backend traffic at Google scale was wasting compute on JSON parsing and bandwidth on field names.

HTTP/1.1 Couldn't Multiplex

A single HTTP/1.1 connection serves one request at a time. Every additional concurrent call needed a new connection — and head-of-line blocking on the connection meant a slow response stalled everything behind it. HTTP/2's stream multiplexing was the unlock.

Schemas Force Discipline

A .proto file is the single source of truth for both client and server in any language. Field numbers, not names, identify fields on the wire — meaning you can rename freely, deprecate without breaking, and add fields in either direction without coordination. The compiler refuses to drop a required field type-mismatch.

Protocol Buffers: The Wire Format

A protobuf message on the wire is a sequence of (tag, value) pairs. The tag encodes the field number plus a 3-bit wire type:

tag = (field_number << 3) | wire_type
wire types: 0 = varint, 1 = 64-bit, 2 = length-delimited, 5 = 32-bit
// (3 and 4 were start/end group; deprecated in proto3)

A varint is the workhorse encoding: 7 bits of payload per byte, with the MSB set as a continuation marker. Values 0-127 take one byte. 128-16383 take two. The number 300 (binary 100101100) becomes AC 02 on the wire. Negative numbers using int32 are zero-extended to 64 bits before encoding, costing 10 bytes — which is why protobuf has a separate sint32 with zigzag encoding (n becomes 2n if positive, -2n-1 if negative) so small negative numbers stay small.

A length-delimited field (wire type 2) is a varint length followed by that many bytes of payload — used for string, bytes, embedded messages, and packed repeated scalars. Embedded messages aren't framed any differently from byte strings; the parser knows the type from the schema.

// proto3 schema
message User {'{'}
  int64 id = 1;
  string name = 2;
  repeated string emails = 3;
{'}'}

// User{'{'}id=300, name="Ada", emails=["a@x", "b@y"]{'}'} on the wire:
// 08 AC 02            field 1 (varint), value 300
// 12 03 41 64 61      field 2 (length=3), bytes "Ada"
// 1A 03 61 40 78      field 3 (length=3), bytes "a@x"
// 1A 03 62 40 79      field 3 (length=3), bytes "b@y"
//
// Total: 18 bytes. The same JSON is ~50 bytes.

A few wire-format consequences worth knowing:

Field numbers are forever. Reuse a number for a different type and the bytes silently misinterpret. The compiler enforces non-reuse only within a single .proto edit; across versions it's the operator's job.
Unknown fields are preserved in proto3 (since 3.5). A new field added by the server flows through old clients untouched and back to a new server, enabling middle-out schema evolution.
Default values aren't transmitted. A proto3 int that is 0 emits zero bytes. This means "field absent" and "field set to default" are indistinguishable on the wire — use optional (re-added in proto3) when presence matters.
Repeated scalar fields default to packed in proto3: one length prefix followed by the concatenated values. A repeated int with a million elements is one tag, one length, then the varints. Massive savings.

HTTP/2 Framing: How an RPC Rides the Wire

Every gRPC call is exactly one HTTP/2 stream. The mapping is:

Initial HEADERS frame (request) — :method=POST, :scheme=https, :path=/package.Service/Method, :authority=host:port, content-type=application/grpc+proto, te=trailers, optional grpc-timeout=5S, optional grpc-encoding=gzip, plus user metadata (custom headers).
One or more DATA frames — the request body. Each gRPC message in the body is prefixed with a 5-byte header: 1 byte compression flag (0 or 1) plus 4 bytes big-endian message length. Multiple messages can be packed into one DATA frame, or a single message can be split across multiple DATA frames.
END_STREAM flag on the last DATA frame — closes the request half of the bidirectional stream.
HEADERS frame (response) — :status=200, content-type. (Note: HTTP status 200 even on logical errors. The actual gRPC status comes in the trailers.)
DATA frames with the response messages, same 5-byte prefix.
Trailing HEADERS frame — grpc-status (the integer status code, 0 = OK), optional grpc-message (UTF-8 description), optional grpc-status-details-bin (base64-encoded google.rpc.Status with structured details).

The "trailers" part is why gRPC requires HTTP/2: HTTP/1.1 trailers exist but are poorly supported, and you need a way to deliver a final status after the body — because streamed responses don't know they'll fail until late. The te: trailers header on the request is mandatory and signals that the client understands trailers.

One TCP+TLS connection can carry up to SETTINGS_MAX_CONCURRENT_STREAMS simultaneous calls (typically 100). Beyond that, the channel either queues new RPCs or opens a second connection, depending on the implementation. HTTP/2 flow control means the receiver controls how fast bytes flow on each stream and across the connection, independently of TCP-level backpressure.

The Four Streaming Modes

Defined by which side of the call sends a stream of messages vs a single message:

service Chat {'{'}
  rpc Send(Message) returns (Ack);                      // unary
  rpc Watch(Query) returns (stream Update);             // server streaming
  rpc Upload(stream Chunk) returns (UploadResult);      // client streaming
  rpc Sync(stream Event) returns (stream Event);        // bidirectional streaming
{'}'}

Unary — the REST analog. Client sends one message, server replies with one message. END_STREAM on both sides immediately.
Server streaming — client sends one request, server sends N responses on the same stream and closes. Used for change feeds, progress updates, log tails. Cancellation propagates: if the client closes its half, the server's context.Done() fires.
Client streaming — client streams N messages, then closes its half; server replies with one final message. Used for chunked uploads, batch insertion, accumulating state.
Bidirectional streaming — both sides send independent streams of messages on the same logical channel. Order between the two streams is not preserved at the protocol level — only ordering within each direction. Used for chat-like protocols, two-way subscriptions, and synchronization.

All four use the same on-the-wire framing: a HEADERS, then DATA frames, then trailing HEADERS. The mode is a code-generation distinction, not a protocol distinction.

Deadlines and Cancellation Propagation

gRPC has no timeout in the HTTP sense; it has deadlines. A deadline is an absolute wall-clock instant by which the call must complete. The client sets it (e.g., 2s from now), the runtime serializes it as the grpc-timeout header, and the server enforces it on its side. When the deadline expires, the runtime synthesizes a DEADLINE_EXCEEDED status and cancels the stream — both ways.

The critical property is propagation. If service A calls B with a 2s deadline, and B internally calls C, B must forward the remaining deadline to C. Most language SDKs make this automatic: the request context carries the deadline, and a derived context for any sub-call inherits the remaining time. This is how an entire fan-out RPC tree dies cleanly when the user-visible deadline expires — instead of orphaning hundreds of background calls that nobody is waiting for.

Cancellation is the deadline's companion. A client that closes the call (user closed the browser tab, parent context cancelled) sends a RST_STREAM frame with error code CANCEL. The server's handler context fires immediately; the handler is expected to release resources and return. Long-running server work that doesn't honor cancellation is the most common gRPC anti-pattern — it leaks goroutines / threads / file descriptors on every cancelled call.

Channel State, Name Resolution, Load Balancing

A gRPC channel is the client-side abstraction over (potentially many) connections to (potentially many) backends. It runs a state machine:

IDLE — channel exists but no active connection. New RPCs trigger a transition to CONNECTING.
CONNECTING — opening a TCP+TLS connection.
READY — at least one connection is up; RPCs flow.
TRANSIENT_FAILURE — last attempt failed; back off and retry.
SHUTDOWN — channel closed by the application; reject new RPCs.

Behind the channel is the name resolver. Schemes like dns:///orders.svc.cluster.local:50051, xds:///orders, unix:///var/run/svc.sock, or custom schemes registered by the application. The resolver returns one or more endpoints (addresses and per-endpoint configuration) and a service config — JSON describing load balancing policy, retry policy, method-level overrides.

The load balancer picks one endpoint per RPC. The standard policies:

pick_first — connect to the resolver's first address; reuse it for all calls. Default for a single-target dial. Falls over to next on failure.
round_robin — open a connection to every endpoint, rotate calls across them. Each backend sees roughly equal load.
xds — full Envoy-style discovery + LB policy from a control plane. Supports weighted clusters, locality awareness, percentage-based traffic splitting (canary, A/B), priority-based failover, and active+passive health checking.
grpclb (legacy) — talk to a separate "look-aside" LB service that returns a list of backends; deprecated in favor of xDS.

Service config is the JSON glue. A typical entry:

{'{'}
  "loadBalancingConfig": [{'{'}"round_robin": {'{}'}{'}'}],
  "methodConfig": [{'{'}
    "name": [{'{'}"service": "orders.OrderService"{'}'}],
    "timeout": "2s",
    "retryPolicy": {'{'}
      "maxAttempts": 4,
      "initialBackoff": "0.1s",
      "maxBackoff": "1s",
      "backoffMultiplier": 2,
      "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"]
    {'}'}
  {'}'}]
{'}'}

Retries, Hedging, and gRFC A6

gRPC retry policy is governed by gRFC A6, the canonical gRPC design document. Retries are off by default — there is no implicit retry the way HTTP libraries often retry on connection refused. You opt in via service config or programmatically.

Once enabled, the client retries on configured status codes (UNAVAILABLE is the only safe default — it specifically means "no work was done"; the server didn't see your request). For other codes, retrying may cause duplicate side effects, so the server is expected to make its handlers idempotent or expose an idempotency key in the request.

Two important gates:

Per-call retry budget — capped to maxAttempts (max 5).
Server retry throttling — the server can return grpc-retry-pushback-ms to tell the client to back off for that many milliseconds, plus configure a token-bucket throttler that suppresses retries entirely when the bucket is empty (i.e. the server is throttling itself).

Hedging is the alternative for tail-latency reduction: send up to N copies of the same request (initial + N-1 hedges) staggered by hedgingDelay; take the first response; cancel the rest. Hedging is mutually exclusive with retries on the same call. It's particularly useful for read calls where one slow replica would otherwise stall the whole RPC.

Interceptors: Cross-Cutting Concerns

Interceptors are gRPC's middleware mechanism. A unary interceptor wraps a handler: it receives the incoming request and the next handler, and can do any of: log, authenticate, inject metadata, mutate context, time the call, short-circuit. Streaming interceptors wrap the stream object so they can observe each message.

Standard patterns:

Auth interceptor — extract bearer token from authorization metadata, verify it, attach claims to context. Reject with UNAUTHENTICATED on failure.
Logging / tracing interceptor — start a span, record method name and duration, propagate trace context (W3C traceparent or B3 headers in metadata).
Rate-limit interceptor — token-bucket per method or per caller.
Validation interceptor — invoke protoc-gen-validate-generated validators on the request before the handler runs.
Recovery interceptor (server-side) — convert panics/uncaught exceptions into INTERNAL status, prevent the worker from dying.

Interceptors compose like functional middleware: each wraps the next, building a chain from outside in. Order matters — auth before rate limit, rate limit before logging (so you log the rejection too), logging before the handler.

Keepalive and Connection Health

A gRPC connection is persistent; an idle connection still costs file descriptors and kernel state, and intermediate NATs/load balancers may silently drop a TCP connection after some inactivity timeout (typically 60-300s on cloud LBs). gRPC keepalive sends periodic HTTP/2 PING frames to detect dead connections and to keep middleboxes from reaping the connection.

Knobs that matter:

keepalive_time — interval between PINGs (default 2 hours, often tuned down to 30-60s).
keepalive_timeout — how long to wait for a PONG before considering the connection dead (default 20s).
permit_without_stream — whether to PING when no RPCs are active. Without this, idle connections are reaped by middleboxes; with this, you keep them warm at the cost of a tiny background heartbeat.
min_ping_interval_without_stream (server-side) — defends against clients that PING aggressively. Default 5 minutes; clients pinging faster get GOAWAY'd.

An aggressive client + permissive server gives sub-second connection-down detection and automatic reconnect via the channel state machine. The server's GOAWAY frame is the graceful equivalent: "finish your in-flight streams up to ID X, then this connection is closing" — used during deploys and load shedding.

gRPC-Web and Browser Limitations

Browsers cannot speak HTTP/2 frames directly from JavaScript — the Fetch API and XMLHttpRequest don't expose stream-level control or trailers. gRPC-Web is the workaround: a slightly different wire format that lets browsers call gRPC services via a proxy.

Two gRPC-Web modes:

grpc-web (text format) — base64-encoded body, trailers appended to the body with a special framing byte. Works with HTTP/1.1.
grpc-web+proto (binary) — same protobuf body as native gRPC but with the trailers concatenated to the body. The browser-side library de-frames and surfaces the trailer status.

Both modes are translated to native gRPC by a server-side proxy (Envoy's grpc_web filter is the canonical implementation). gRPC-Web supports unary and server-streaming RPCs. Client streaming and bidirectional streaming are not supported in any browser today — a fundamental limitation of Fetch's request body semantics.

gRPC vs Other RPC Styles

	gRPC	REST + JSON	GraphQL	Apache Thrift
Wire format	Protobuf (binary)	JSON (text)	JSON over HTTP	Binary (Compact / Binary protocol)
Transport	HTTP/2 only	HTTP/1.1, /2, /3	HTTP, WebSocket subscriptions	TCP, HTTP, custom
Schema	.proto, code-generated	OpenAPI optional	SDL, code-generated typed clients	.thrift, code-generated
Streaming	4 modes built-in	SSE, long-poll, WebSocket bolt-ons	Subscriptions (WebSocket)	Limited
Browser	gRPC-Web with proxy	Native	Native	Limited
Best at	Internal microservices, polyglot	Public APIs, simple integrations	Front-end driven aggregation	Facebook-era polyglot (legacy)
Schema evolution	Strong; field numbers	Conventional	Strong but explicit	Strong; field IDs
Tooling	protoc, buf, grpcurl, ghz, evans	curl, Postman, OpenAPI generators	GraphiQL, Apollo	thrift compiler

Tradeoffs and Honest Weaknesses

Browser story is awkward — gRPC-Web requires a translating proxy and doesn't support client/bidi streaming. For consumer-facing APIs, REST+JSON or GraphQL is still the right answer.
Debugging is harder than HTTP — protobuf isn't readable in tcpdump or Wireshark without a .proto-aware decoder. grpcurl, ghz, and evans exist precisely because curl doesn't speak gRPC.
HTTP/2 dependency — works fine in modern infrastructure but doesn't compose with HTTP/1-only intermediaries (some legacy load balancers, ancient proxies). HTTP/3 support is improving but uneven across language SDKs.
Proto3 default-value erasure — "field absent" and "field set to default" look the same on the wire. The optional keyword fixes it but requires schema discipline.
Default 4 MiB message limit — silent breakage when payloads grow. Tunable, but the default catches everyone exactly once.
Status code domain is small — 17 codes for everything. Real applications need richer error info, which means either grpc-status-details-bin (the structured google.rpc.Status proto) or stuffing details into the message. There's no analog of REST's "embed JSON error in 400 body" pattern that everyone already groks.
Streaming is harder than it looks — backpressure across language boundaries, half-close semantics, mid-stream cancellation, and resource leakage on dropped streams are subtle. Production gRPC streaming requires careful testing.

Frequently Asked Questions

Why does HTTP status always return 200 even for errors?

Because gRPC does not map errors to HTTP status codes. The transport layer (HTTP/2) succeeds even when the call logically fails. The actual outcome is in the grpc-status trailer: 0 = OK, 1 = CANCELLED, 2 = UNKNOWN, ..., 14 = UNAVAILABLE, 16 = UNAUTHENTICATED. This separation is necessary because streamed responses produce bytes successfully right up until the moment they fail; you cannot send the HTTP status header twice. Trailers are how gRPC reports outcome after the body.

Should I use unary calls or server streaming for fetching a list?

Unary if the result fits in one message comfortably (sub-megabyte, low-thousands of items). Server streaming when the list is large enough that materializing it on the server is expensive, or when results arrive incrementally (search, log tail). Streaming has higher per-message overhead due to framing and backpressure machinery; for small results it's slower than unary.

What's the difference between Channel and ClientConn (in Go) and Stub?

The Channel/ClientConn is the connection-management layer — it owns the underlying HTTP/2 connections, name resolver, load balancer, and channel state machine. A single channel is meant to be created once per (target, credential) pair and shared across the application. The Stub is the type-safe wrapper generated from the .proto: it offers method calls like orderClient.GetOrder(ctx, req) and serializes them onto the channel. Stubs are cheap; channels are expensive.

What does grpc-encoding do, and should I enable gzip?

grpc-encoding in headers selects a per-message compression codec (gzip, deflate, snappy, identity). Each DATA frame's 5-byte prefix has a compression-flag byte that's set when the message is compressed. Default is identity (no compression). Enable gzip for verbose payloads (large lists, long strings) where the CPU cost is paid back by network savings. For small chatty RPCs, gzip is net-negative. Modern setups often skip per-message compression and rely on TLS-layer compression being disabled (CRIME) plus large enough messages to warrant the codec.

How do I version a gRPC API without breaking existing clients?

Two layers. Within a service: never reuse a field number, never change a field's type, mark deprecated fields as [deprecated = true], add new fields with new numbers. Across major versions: bump the package name (orders.v1 → orders.v2) so old and new live side by side. Servers can implement both, clients pick one. The HTTP/2 path naturally encodes the version: /orders.v1.OrderService/Get vs /orders.v2.OrderService/Get route to different handlers.

Why does my gRPC server occasionally return UNAVAILABLE on the first request after idle?

A connection that's been idle for the keepalive_time and was reaped by an intermediate NAT or load balancer. The TCP RST happens silently; the next RPC observes the broken pipe and surfaces UNAVAILABLE. The fix is either to lower keepalive_time below the middlebox idle threshold (set permit_without_stream=true and keepalive_time=30s or so) or rely on retry policy with UNAVAILABLE as a retryable code, which automatically reconnects.

Is xDS for gRPC the same xDS Envoy uses?

Yes — that's the whole point. gRPC's xDS resolver speaks the same LDS / RDS / CDS / EDS protocol as Envoy, against the same control plane (Istio, Cilium Service Mesh, Google Cloud Service Mesh). This means a gRPC client without a sidecar can do everything Envoy can: weighted clusters, locality-aware routing, traffic shifting, percentage canaries, fault injection. It's "proxyless service mesh" — gRPC clients become first-class mesh participants, eliminating the sidecar's RAM cost and one network hop per request.