LLM Quantization
int8, int4, fp8, BitNet — How to Halve, Quarter, or Octuple Your Model on the Same GPU
A 70B-parameter model in fp16 is 140 GB. That doesn't fit on an A100 (80 GB), much less an RTX 4090 (24 GB). Quantization is the family of techniques that map fp16 weights to fewer bits — int8, int4, fp8, even 1.58 bits — while keeping output quality nearly identical. The result: bigger models on smaller GPUs, faster inference (memory-bound, so fewer bits = more bandwidth), and lower power. The state of the art is 4-bit weight + fp16 activation; the frontier is fp8 weight + fp8 activation on H100s and 1.58-bit ternary in research.
The Quantization Map
From fp32 down to ternary. Each step buys memory and bandwidth at the cost of accuracy.
Key Numbers
1. The Basics: Symmetric vs Asymmetric, Per-Tensor vs Per-Channel
Quantization maps a real number x to an integer q via a scale (and optionally a zero point):
# Symmetric (zero point = 0)
scale = max(|w|) / 127
q = round(w / scale) # int8 in [-127, 127]
w_hat = q * scale # dequantized value
# Asymmetric (better for ReLU outputs that are all positive)
scale = (max(w) - min(w)) / 255
zp = round(-min(w) / scale) # zero point
q = round(w / scale + zp) # uint8 in [0, 255]
w_hat = (q - zp) * scale One scale per tensor is cheapest but loses range. Per-channel uses one scale per output row of a weight matrix — the standard. Per-group further subdivides each row into groups of 64 or 128, with a separate scale per group. AWQ and GPTQ both use group size 128.
2. LLM.int8() — bitsandbytes
Tim Dettmers' 2022 paper. The first practical int8 LLM scheme. Insight: outliers wreck quantization unless you handle them separately.
# LLM.int8 mixed-precision matmul:
# 1. Detect outlier columns of activations (> 6.0 absolute value)
# 2. Compute outlier columns in fp16 (small fraction)
# 3. Compute non-outlier in int8 (vast majority)
# 4. Sum results
mask_outlier = (|x|.max(dim=0) > 6.0)
y_int8 = x[:, ~mask_outlier].int8() @ W[~mask_outlier].int8()
y_fp16 = x[:, mask_outlier] @ W[mask_outlier]
y = y_int8.dequantize() + y_fp16 Why outliers matter: a few feature dimensions in transformers carry huge magnitudes (the classic 1.7% of features have ~10× higher activations). Standard int8 forced a giant scale factor across all features, destroying precision for the small ones. Splitting them out preserves the small features.
3. GPTQ, AWQ, ExLlama: Weight-Only int4
The mainstream local-LLM stack. Quantize weights to int4 offline; activations stay fp16.
GPTQ (Frantar et al. 2023) does layer-by-layer quantization, using second-order info (the Hessian) to pick which weights to quantize first and how to compensate for error in the remaining weights. Output: about 0.3% perplexity loss for int4 on Llama-2 70B.
# GPTQ pseudocode for one layer
H = compute_hessian(calibration_data) # input outer-product
for col in weight_cols(W):
q_col = quantize_int4(col)
err = (col - dequant(q_col))
# Update remaining columns to compensate for this error
W[:, col+1:] -= err * H[col, col+1:] / H[col, col] AWQ (Activation-aware Weight Quantization, Lin et al. 2023) takes a different angle: not all weights matter equally. The 1% of weights connected to the largest-magnitude activations dominate output quality. Scale those weights up before quantization (and scale activations down to compensate), so the precision goes where it counts.
ExLlama / ExLlamaV2 are inference engines focused on int4 throughput on consumer GPUs (RTX 3090/4090). They use a custom CUDA kernel that fuses dequantization with matmul and supports mixed bit widths per tensor (some layers at 6-bit, some at 4-bit) for accuracy/speed tuning.
4. fp8: e4m3 and e5m2 (Hopper / Blackwell)
NVIDIA's H100 introduced two 8-bit floating-point formats:
| Format | Exponent | Mantissa | Range | Use |
|---|---|---|---|---|
| e4m3 | 4 bits | 3 bits | ±448 (no inf) | Weights and activations (forward) |
| e5m2 | 5 bits | 2 bits | ±57344 | Gradients (backward, more range needed) |
fp8 vs int8: floating point preserves dynamic range. Integer 8-bit struggles with outliers in attention activations; fp8's exponent absorbs them naturally. Combined with per-tensor scaling (one scale factor per tensor, computed on-the-fly), fp8 inference matches fp16 to within 0.5% on most benchmarks — and runs 2× faster on H100.
Frameworks: TensorRT-LLM, vLLM (recent), Transformer Engine (NVIDIA's fp8 training library).
5. BitNet b1.58 — The 1.58-Bit Frontier
Microsoft Research, 2024. Replace every weight with a value in 1. log₂(3) ≈ 1.58 bits per weight.
# Quantization-aware training, not post-hoc
def bitnet_quantize(W):
scale = mean(|W|)
W_q = round_clip(W / scale, -1, +1) # ternary
return W_q, scale
# Forward pass: matmul becomes adds and subtractions
# (no multiplications for the weight matmul itself)
# Activations stay int8. The trick: BitNet is trained from scratch with quantization-aware training (QAT), not retrofitted onto a fp16 model. Surprising result: at 3B+ parameters, BitNet b1.58 matches fp16 LLM perplexity. Below 3B, it noticeably underperforms.
Hardware implication: ternary weights enable matmul-free inference (lookup tables and adders only). No real production silicon yet, but FPGA and custom ASIC interest is high.
6. PTQ vs QAT
| Post-Training Quantization (PTQ) | Quantization-Aware Training (QAT) | |
|---|---|---|
| When | After training, on a frozen model | During training (or finetuning) |
| Cost | Hours on a single GPU | Full training run |
| Quality | ~0.3–1% perplexity loss for int4 | Matches fp16 even at 1.58 bits (BitNet) |
| Examples | GPTQ, AWQ, LLM.int8 | BitNet, OmniQuant |
Practitioners almost always use PTQ. QAT is reserved for extreme bit widths (≤2) where PTQ collapses, or for end-to-end edge deployments where model size dictates everything.
Tradeoffs
| Choice | Win | Cost |
|---|---|---|
| Weight-only int4 (GPTQ/AWQ) | 4× memory; near-fp16 quality | Activations still fp16; less compute speedup |
| Weight + activation int8 | Faster compute on int8 silicon | Outlier handling is hard |
| fp8 (H100) | 2× compute, 2× memory; small accuracy loss | Hopper-only; needs Transformer Engine |
| 1.58-bit (QAT) | ~9× memory; matmul-free | Must retrain; no off-the-shelf hardware |
| Per-group scales (g=128) | Better accuracy than per-tensor | Extra metadata; slightly slower kernel |
FAQ
Why is int4 the sweet spot for local inference?
Memory bandwidth is the bottleneck on consumer GPUs. int4 halves the memory traffic of int8 with only ~0.5% extra perplexity. Below int4, the calibration data and group-size overhead grows, and quality drops faster.
Is GGUF a quantization scheme?
No — GGUF is a file format used by llama.cpp. Inside, it stores weights in one of many quantization schemes: Q4_0, Q4_K_M, Q5_K_S, etc. The "K-quants" use group sizes and mixed bit widths within a layer (more precision on important weights).
Why don't all GPUs support int4 matmul?
Tensor Cores added int4 support in Ampere (A100). Before that, int4 had to be unpacked to int8 inline, which negated half the throughput. fp8 is even more recent — only Hopper (H100) and later have native fp8 matmul.
Does quantization affect LoRA / PEFT?
Yes — QLoRA (Dettmers 2023) keeps base weights in 4-bit NF4 and trains LoRA adapters in fp16. Forward pass dequantizes weights on-the-fly. Reduces VRAM for finetuning a 65B model from 780 GB to 48 GB.
What's NF4?
NormalFloat 4-bit. A non-uniform 4-bit format whose levels are placed at the quantiles of a normal distribution — the empirical distribution of trained weights. About 0.3% better perplexity than uniform int4 at the same bit width. Used by QLoRA.
Do I need calibration data?
For activation-aware methods (AWQ, GPTQ) yes — usually 128–1024 samples representative of your task. For pure weight-only methods (LLM.int8 weight quant), no. For QAT, the entire training corpus.