LLM Inference Serving
vLLM, TGI, TensorRT-LLM, sgLang — Continuous Batching and the Tricks That Make a 70B Model Cheap
Training is one-shot; inference runs forever. The economics of an LLM deployment hinge entirely on how many tokens per second per GPU you can sustain — and on whether all those tokens can be amortized across requests. The naive answer (one request, one model copy) wastes 95%+ of GPU time. Modern serving stacks (vLLM, TGI, TensorRT-LLM, sgLang) all converge on the same handful of techniques: continuous batching, PagedAttention, and speculative decoding. Together these turn a 70B model from unaffordable to a few cents per million tokens.
The Two-Phase Anatomy of LLM Inference
Every request has a compute-bound prefill and a memory-bound decode. They have wildly different bottlenecks.
Key Numbers
1. Static vs Dynamic vs Continuous Batching
Three generations of batching strategies. Continuous batching won.
| Strategy | Behavior | GPU util |
|---|---|---|
| Static batching | Pad all sequences to max length, decode in lockstep until longest finishes | ~10–30% |
| Dynamic batching | Batch requests that arrive within a small time window, otherwise like static | ~30–50% |
| Continuous batching (Orca, vLLM) | When any sequence emits EOS, evict it and admit a waiting request immediately, mid-batch | ~70–90% |
The key insight of continuous batching (Orca, OSDI 2022): the batch slot is the unit of scheduling, not the request. A finished sequence frees its slot at the next decode step, where a new request can take over. Iteration-level scheduling instead of request-level scheduling.
# Continuous batching loop (simplified)
while True:
# 1. Admit any waiting requests that fit in batch slots
for req in queue:
if free_kv_pages() >= req.estimated_pages:
admit(req)
# 2. One forward pass over the active batch
next_tokens = model.decode_step(active_batch)
# 3. Evict finished requests, free their KV pages
for i, tok in enumerate(next_tokens):
if tok == EOS or active_batch[i].len >= max_len:
return_response(active_batch[i])
evict(i) 2. PagedAttention (vLLM)
Borrow virtual memory paging from operating systems. Apply it to the KV cache.
The KV cache is the dominant memory cost. Naive allocation reserves max_seq_len contiguous slots per request. With max_seq_len = 8192 and average completion of ~200 tokens, you waste 97.5% of allocated KV memory.
PagedAttention (Kwon et al., SOSP 2023) solves this by storing the KV cache in fixed-size blocks (default 16 tokens), tracked by per-request page tables. New blocks are allocated on demand. Blocks can be shared across requests (prefix sharing for system prompts).
# Each request has a logical sequence of KV blocks
# These map to physical blocks via a page table — like a tiny MMU.
class Sequence:
block_table: list[int] # logical → physical block id
# When a token is generated:
# if last block has free slot: append in place
# else: allocate new physical block, append to table
# When a request finishes:
# release each physical block back to the pool
# Memory waste drops from ~75% to ~4%. The cost: the attention kernel must do an indirect lookup through the page table. vLLM's CUDA kernel handles this in fewer cycles than the HBM read it gates, so net throughput improves.
3. Speculative Decoding
Use a small "draft" model to propose K tokens, verify them all in one big-model forward pass.
Decode is memory-bound: a single-token forward pass uses a tiny fraction of the GPU's compute. Speculative decoding (Leviathan, Chen 2022) exploits this:
# 1. Cheap draft model (e.g., Llama-3 1B) proposes 4 tokens
draft_tokens = draft_model.generate(prefix, n=4)
# 2. Big model verifies all 4 in ONE forward pass
# (parallel; uses the spare compute that decode wastes)
big_logits = big_model.forward(prefix + draft_tokens)
# 3. Accept the longest prefix where draft matches big-model sample
accepted = []
for i, t in enumerate(draft_tokens):
if sample(big_logits[i]) == t:
accepted.append(t)
else:
accepted.append(sample(big_logits[i])) # corrected token
break
# Result: 1.5-3× speedup, identical output distribution. Variants: Medusa attaches multiple decoding heads to the big model itself (no separate draft). EAGLE drafts at the feature level rather than tokens. Lookahead decoding searches a Jacobi trajectory without any draft model.
Acceptance rate depends on draft quality. For a Llama-3 70B target, a 1B draft typically lands 60–70% acceptance, giving ~2× wall-clock speedup with no quality loss.
4. KV Cache Layout
How the cache is shaped, sliced, and shared across attention heads.
Per layer, the KV cache stores (seq_len, num_kv_heads, head_dim) for both K and V. The shape choice matters because it determines memory bandwidth patterns:
# Standard MHA: num_kv_heads = num_q_heads
# GQA (Llama-3): num_kv_heads = 8, num_q_heads = 64
# MQA (older): num_kv_heads = 1
# KV size for one layer, one request:
# 2 (K and V) * seq_len * num_kv_heads * head_dim * 2 bytes (fp16)
# Llama-3 8B example: 32 layers, 8 KV heads, 128 head_dim
# per token: 2 * 32 * 8 * 128 * 2 = 131,072 bytes = 128 KB
# 8192-token context: 1 GB of KV cache PER REQUEST
# vLLM block-aligns this so each "block" holds 16 tokens
# block_size_bytes = 16 * 128 KB = 2 MB vLLM stores the cache as [num_blocks, block_size, num_kv_heads, head_dim] for each of K and V, with the page table mapping logical positions to block ids. The CUDA kernel reads through this indirection during the QKᵀ computation.
5. Stack Comparison: vLLM, TGI, TensorRT-LLM, sgLang
| Stack | Origin | Strengths | Weaknesses |
|---|---|---|---|
| vLLM | UC Berkeley, 2023 | PagedAttention reference, Python-friendly, broad model support, prefix caching | Not as tuned as TRT-LLM on H100; complex multi-LoRA |
| TGI (Text Generation Inference) | HuggingFace, 2022 | Production-grade HTTP server, tight HF ecosystem integration | Slower than vLLM in throughput benchmarks; less active |
| TensorRT-LLM | NVIDIA | Best raw H100/H200 throughput, fp8 first-class, kernel fusion | NVIDIA-only, ahead-of-time compile, harder DX |
| sgLang | UC Berkeley/CMU, 2024 | RadixAttention prefix sharing, fastest structured generation, JSON-mode kernels | Newer, less battle-tested in prod |
For most teams: vLLM is the default. TensorRT-LLM if you've squeezed vLLM and need the last 2× on H100s. sgLang if your workload is heavy on structured output (JSON, function calling) or shared prefixes (RAG, system prompts).
6. Prefix Caching
If two requests share a prefix (e.g., the same system prompt), the prefill computation for that prefix is identical. Cache it.
# Hash the token sequence → KV blocks that contain it
prefix_hash = hash(tokens[:1024])
if prefix_hash in cache:
# Reuse cached KV blocks — no compute needed
block_table = cache[prefix_hash].copy()
process_only_suffix(tokens[1024:])
else:
# Standard prefill, then memoize
blocks = compute_kv(tokens[:1024])
cache[prefix_hash] = blocks vLLM's --enable-prefix-caching flag makes this automatic. sgLang's RadixAttention does it via a radix tree of prefix hashes, sharing across all in-flight requests. For RAG workloads with long retrieved contexts, this is the single biggest cost saver — often 3–5× throughput improvement.
Tradeoffs
| Optimization | Win | Cost |
|---|---|---|
| Continuous batching | 5–10× throughput | Tail latency variance from preemption |
| PagedAttention | 4× more concurrent requests | Kernel complexity, indirect KV reads |
| Speculative decoding | 1.5–3× latency | Hosting a second (draft) model |
| Prefix caching | 3–5× on RAG workloads | Memory pressure; cache eviction policy |
| fp8 / int8 quant | 1.5–2× throughput, half memory | ~1% quality loss, calibration data needed |
FAQ
Why is decode memory-bound but prefill compute-bound?
Prefill processes hundreds of tokens at once — the GEMMs are large enough to saturate the GPU's TFLOPs. Decode processes one token at a time per request, so each matmul is tiny; the cost is dominated by reading model weights and KV cache from HBM.
Does continuous batching hurt latency?
Throughput goes up; per-token latency stays roughly the same. Time-to-first-token can suffer if the queue is deep and your request gets queued behind a long prefill, but most schedulers prioritize prefill of waiting requests over decode steps to keep TTFT low.
When is speculative decoding NOT worth it?
When draft acceptance rate is below ~40% — you pay for draft inference and reject most tokens. Also bad for very small targets (1B–3B) where the draft cost dominates. Best for 70B+ targets with a competent ~1B draft.
What's chunked prefill?
Splitting a long prefill into chunks of (say) 512 tokens, interleaved with decode steps for other requests. Prevents one long prompt from blocking all decode tokens behind it. vLLM and TRT-LLM both support this.
Why is fp8 the new standard for inference?
H100 has 2× the fp8 TFLOPs vs fp16, plus half the memory bandwidth requirement. Combined with per-tensor scaling factors, accuracy loss vs fp16 is usually under 1%. TensorRT-LLM, vLLM, and SGLang all support fp8 weights+activations natively now.
How do I tune for cost vs latency?
Throughput-optimized (cost): max batch size, large block sizes, allow more queueing. Latency-optimized (TTFT): smaller batches, prefix cache aggressively, speculative decoding on. Most production stacks let you pick a profile per replica.