LLM Inference Serving

vLLM, TGI, TensorRT-LLM, sgLang — Continuous Batching and the Tricks That Make a 70B Model Cheap

Training is one-shot; inference runs forever. The economics of an LLM deployment hinge entirely on how many tokens per second per GPU you can sustain — and on whether all those tokens can be amortized across requests. The naive answer (one request, one model copy) wastes 95%+ of GPU time. Modern serving stacks (vLLM, TGI, TensorRT-LLM, sgLang) all converge on the same handful of techniques: continuous batching, PagedAttention, and speculative decoding. Together these turn a 70B model from unaffordable to a few cents per million tokens.

The Two-Phase Anatomy of LLM Inference

Every request has a compute-bound prefill and a memory-bound decode. They have wildly different bottlenecks.

Prefill (parallel) Process all input tokens at once Compute-bound · GEMM heavy ~10ms for 1k tokens · A100 Decode (sequential) One token at a time, autoregressive Memory-bound · KV cache reads ~20ms/token · KV-cache I/O Bottleneck: Prefill: GPU compute (TFLOPS) Decode: HBM bandwidth (TB/s) Batching helps decode disproportionately.

Key Numbers

~24×
vLLM throughput vs naive HF transformers (PagedAttention paper)
96%
Memory waste in static-padded batching (the problem PagedAttention solves)
2-3×
Speculative decoding speedup (draft model + 4-token verify)
2 KB
KV cache per token, Llama-3 8B (8192 ctx, fp16)
16 KB
vLLM block size — minimum KV allocation unit
2.0 TB/s
A100 HBM bandwidth — the decode bottleneck
~80%
GPU utilization with continuous batching vs ~10% static

1. Static vs Dynamic vs Continuous Batching

Three generations of batching strategies. Continuous batching won.

StrategyBehaviorGPU util
Static batchingPad all sequences to max length, decode in lockstep until longest finishes~10–30%
Dynamic batchingBatch requests that arrive within a small time window, otherwise like static~30–50%
Continuous batching (Orca, vLLM)When any sequence emits EOS, evict it and admit a waiting request immediately, mid-batch~70–90%

The key insight of continuous batching (Orca, OSDI 2022): the batch slot is the unit of scheduling, not the request. A finished sequence frees its slot at the next decode step, where a new request can take over. Iteration-level scheduling instead of request-level scheduling.

# Continuous batching loop (simplified)
while True:
    # 1. Admit any waiting requests that fit in batch slots
    for req in queue:
        if free_kv_pages() >= req.estimated_pages:
            admit(req)

    # 2. One forward pass over the active batch
    next_tokens = model.decode_step(active_batch)

    # 3. Evict finished requests, free their KV pages
    for i, tok in enumerate(next_tokens):
        if tok == EOS or active_batch[i].len >= max_len:
            return_response(active_batch[i])
            evict(i)

2. PagedAttention (vLLM)

Borrow virtual memory paging from operating systems. Apply it to the KV cache.

The KV cache is the dominant memory cost. Naive allocation reserves max_seq_len contiguous slots per request. With max_seq_len = 8192 and average completion of ~200 tokens, you waste 97.5% of allocated KV memory.

PagedAttention (Kwon et al., SOSP 2023) solves this by storing the KV cache in fixed-size blocks (default 16 tokens), tracked by per-request page tables. New blocks are allocated on demand. Blocks can be shared across requests (prefix sharing for system prompts).

# Each request has a logical sequence of KV blocks
# These map to physical blocks via a page table — like a tiny MMU.

class Sequence:
    block_table: list[int]    # logical → physical block id

# When a token is generated:
#   if last block has free slot: append in place
#   else:                       allocate new physical block, append to table

# When a request finishes:
#   release each physical block back to the pool

# Memory waste drops from ~75% to ~4%.

The cost: the attention kernel must do an indirect lookup through the page table. vLLM's CUDA kernel handles this in fewer cycles than the HBM read it gates, so net throughput improves.

3. Speculative Decoding

Use a small "draft" model to propose K tokens, verify them all in one big-model forward pass.

Decode is memory-bound: a single-token forward pass uses a tiny fraction of the GPU's compute. Speculative decoding (Leviathan, Chen 2022) exploits this:

# 1. Cheap draft model (e.g., Llama-3 1B) proposes 4 tokens
draft_tokens = draft_model.generate(prefix, n=4)

# 2. Big model verifies all 4 in ONE forward pass
# (parallel; uses the spare compute that decode wastes)
big_logits = big_model.forward(prefix + draft_tokens)

# 3. Accept the longest prefix where draft matches big-model sample
accepted = []
for i, t in enumerate(draft_tokens):
    if sample(big_logits[i]) == t:
        accepted.append(t)
    else:
        accepted.append(sample(big_logits[i]))   # corrected token
        break

# Result: 1.5-3× speedup, identical output distribution.

Variants: Medusa attaches multiple decoding heads to the big model itself (no separate draft). EAGLE drafts at the feature level rather than tokens. Lookahead decoding searches a Jacobi trajectory without any draft model.

Acceptance rate depends on draft quality. For a Llama-3 70B target, a 1B draft typically lands 60–70% acceptance, giving ~2× wall-clock speedup with no quality loss.

4. KV Cache Layout

How the cache is shaped, sliced, and shared across attention heads.

Per layer, the KV cache stores (seq_len, num_kv_heads, head_dim) for both K and V. The shape choice matters because it determines memory bandwidth patterns:

# Standard MHA:  num_kv_heads = num_q_heads
# GQA (Llama-3): num_kv_heads = 8, num_q_heads = 64
# MQA (older):   num_kv_heads = 1

# KV size for one layer, one request:
#   2 (K and V) * seq_len * num_kv_heads * head_dim * 2 bytes (fp16)

# Llama-3 8B example: 32 layers, 8 KV heads, 128 head_dim
#   per token: 2 * 32 * 8 * 128 * 2 = 131,072 bytes = 128 KB
#   8192-token context: 1 GB of KV cache PER REQUEST

# vLLM block-aligns this so each "block" holds 16 tokens
#   block_size_bytes = 16 * 128 KB = 2 MB

vLLM stores the cache as [num_blocks, block_size, num_kv_heads, head_dim] for each of K and V, with the page table mapping logical positions to block ids. The CUDA kernel reads through this indirection during the QKᵀ computation.

5. Stack Comparison: vLLM, TGI, TensorRT-LLM, sgLang

StackOriginStrengthsWeaknesses
vLLMUC Berkeley, 2023PagedAttention reference, Python-friendly, broad model support, prefix cachingNot as tuned as TRT-LLM on H100; complex multi-LoRA
TGI (Text Generation Inference)HuggingFace, 2022Production-grade HTTP server, tight HF ecosystem integrationSlower than vLLM in throughput benchmarks; less active
TensorRT-LLMNVIDIABest raw H100/H200 throughput, fp8 first-class, kernel fusionNVIDIA-only, ahead-of-time compile, harder DX
sgLangUC Berkeley/CMU, 2024RadixAttention prefix sharing, fastest structured generation, JSON-mode kernelsNewer, less battle-tested in prod

For most teams: vLLM is the default. TensorRT-LLM if you've squeezed vLLM and need the last 2× on H100s. sgLang if your workload is heavy on structured output (JSON, function calling) or shared prefixes (RAG, system prompts).

6. Prefix Caching

If two requests share a prefix (e.g., the same system prompt), the prefill computation for that prefix is identical. Cache it.

# Hash the token sequence → KV blocks that contain it
prefix_hash = hash(tokens[:1024])

if prefix_hash in cache:
    # Reuse cached KV blocks — no compute needed
    block_table = cache[prefix_hash].copy()
    process_only_suffix(tokens[1024:])
else:
    # Standard prefill, then memoize
    blocks = compute_kv(tokens[:1024])
    cache[prefix_hash] = blocks

vLLM's --enable-prefix-caching flag makes this automatic. sgLang's RadixAttention does it via a radix tree of prefix hashes, sharing across all in-flight requests. For RAG workloads with long retrieved contexts, this is the single biggest cost saver — often 3–5× throughput improvement.

Tradeoffs

OptimizationWinCost
Continuous batching5–10× throughputTail latency variance from preemption
PagedAttention4× more concurrent requestsKernel complexity, indirect KV reads
Speculative decoding1.5–3× latencyHosting a second (draft) model
Prefix caching3–5× on RAG workloadsMemory pressure; cache eviction policy
fp8 / int8 quant1.5–2× throughput, half memory~1% quality loss, calibration data needed

FAQ

Why is decode memory-bound but prefill compute-bound?

Prefill processes hundreds of tokens at once — the GEMMs are large enough to saturate the GPU's TFLOPs. Decode processes one token at a time per request, so each matmul is tiny; the cost is dominated by reading model weights and KV cache from HBM.

Does continuous batching hurt latency?

Throughput goes up; per-token latency stays roughly the same. Time-to-first-token can suffer if the queue is deep and your request gets queued behind a long prefill, but most schedulers prioritize prefill of waiting requests over decode steps to keep TTFT low.

When is speculative decoding NOT worth it?

When draft acceptance rate is below ~40% — you pay for draft inference and reject most tokens. Also bad for very small targets (1B–3B) where the draft cost dominates. Best for 70B+ targets with a competent ~1B draft.

What's chunked prefill?

Splitting a long prefill into chunks of (say) 512 tokens, interleaved with decode steps for other requests. Prevents one long prompt from blocking all decode tokens behind it. vLLM and TRT-LLM both support this.

Why is fp8 the new standard for inference?

H100 has 2× the fp8 TFLOPs vs fp16, plus half the memory bandwidth requirement. Combined with per-tensor scaling factors, accuracy loss vs fp16 is usually under 1%. TensorRT-LLM, vLLM, and SGLang all support fp8 weights+activations natively now.

How do I tune for cost vs latency?

Throughput-optimized (cost): max batch size, large block sizes, allow more queueing. Latency-optimized (TTFT): smaller batches, prefix cache aggressively, speculative decoding on. Most production stacks let you pick a profile per replica.