LLM Quantization

int8, int4, fp8, BitNet — How to Halve, Quarter, or Octuple Your Model on the Same GPU

A 70B-parameter model in fp16 is 140 GB. That doesn't fit on an A100 (80 GB), much less an RTX 4090 (24 GB). Quantization is the family of techniques that map fp16 weights to fewer bits — int8, int4, fp8, even 1.58 bits — while keeping output quality nearly identical. The result: bigger models on smaller GPUs, faster inference (memory-bound, so fewer bits = more bandwidth), and lower power. The state of the art is 4-bit weight + fp16 activation; the frontier is fp8 weight + fp8 activation on H100s and 1.58-bit ternary in research.

The Quantization Map

From fp32 down to ternary. Each step buys memory and bandwidth at the cost of accuracy.

bits/weight 32 fp32 (full precision, training) 16 fp16 / bf16 (default LLM training and inference) 8 int8 / fp8 (LLM.int8, e4m3, e5m2) 4 int4 (GPTQ, AWQ, ExLlama, NF4) — most common 3 int3 (research, ExLlama variants) 2 int2 (significant quality loss) 1.58 BitNet b1.58 — ternary 1, 8.9× memory savings

Key Numbers

Memory savings: int4 vs fp16 (140 GB Llama-70B → 35 GB)
~1%
Typical perplexity increase from int4 GPTQ vs fp16
fp8 throughput on H100 vs fp16 (Tensor Core ops)
8.9×
BitNet b1.58 memory reduction vs fp16
128
Group size for AWQ/GPTQ — group of weights sharing one scale
e4m3
fp8 format used for weights and activations (4 exp, 3 mantissa)
512–1024
Calibration samples needed for GPTQ-style PTQ

1. The Basics: Symmetric vs Asymmetric, Per-Tensor vs Per-Channel

Quantization maps a real number x to an integer q via a scale (and optionally a zero point):

# Symmetric (zero point = 0)
scale = max(|w|) / 127
q     = round(w / scale)             # int8 in [-127, 127]
w_hat = q * scale                    # dequantized value

# Asymmetric (better for ReLU outputs that are all positive)
scale = (max(w) - min(w)) / 255
zp    = round(-min(w) / scale)       # zero point
q     = round(w / scale + zp)        # uint8 in [0, 255]
w_hat = (q - zp) * scale

One scale per tensor is cheapest but loses range. Per-channel uses one scale per output row of a weight matrix — the standard. Per-group further subdivides each row into groups of 64 or 128, with a separate scale per group. AWQ and GPTQ both use group size 128.

2. LLM.int8() — bitsandbytes

Tim Dettmers' 2022 paper. The first practical int8 LLM scheme. Insight: outliers wreck quantization unless you handle them separately.

# LLM.int8 mixed-precision matmul:
# 1. Detect outlier columns of activations (> 6.0 absolute value)
# 2. Compute outlier columns in fp16 (small fraction)
# 3. Compute non-outlier in int8 (vast majority)
# 4. Sum results

mask_outlier = (|x|.max(dim=0) > 6.0)
y_int8  = x[:, ~mask_outlier].int8() @ W[~mask_outlier].int8()
y_fp16  = x[:, mask_outlier]         @ W[mask_outlier]
y       = y_int8.dequantize() + y_fp16

Why outliers matter: a few feature dimensions in transformers carry huge magnitudes (the classic 1.7% of features have ~10× higher activations). Standard int8 forced a giant scale factor across all features, destroying precision for the small ones. Splitting them out preserves the small features.

3. GPTQ, AWQ, ExLlama: Weight-Only int4

The mainstream local-LLM stack. Quantize weights to int4 offline; activations stay fp16.

GPTQ (Frantar et al. 2023) does layer-by-layer quantization, using second-order info (the Hessian) to pick which weights to quantize first and how to compensate for error in the remaining weights. Output: about 0.3% perplexity loss for int4 on Llama-2 70B.

# GPTQ pseudocode for one layer
H = compute_hessian(calibration_data)           # input outer-product
for col in weight_cols(W):
    q_col = quantize_int4(col)
    err   = (col - dequant(q_col))
    # Update remaining columns to compensate for this error
    W[:, col+1:] -= err * H[col, col+1:] / H[col, col]

AWQ (Activation-aware Weight Quantization, Lin et al. 2023) takes a different angle: not all weights matter equally. The 1% of weights connected to the largest-magnitude activations dominate output quality. Scale those weights up before quantization (and scale activations down to compensate), so the precision goes where it counts.

ExLlama / ExLlamaV2 are inference engines focused on int4 throughput on consumer GPUs (RTX 3090/4090). They use a custom CUDA kernel that fuses dequantization with matmul and supports mixed bit widths per tensor (some layers at 6-bit, some at 4-bit) for accuracy/speed tuning.

4. fp8: e4m3 and e5m2 (Hopper / Blackwell)

NVIDIA's H100 introduced two 8-bit floating-point formats:

FormatExponentMantissaRangeUse
e4m34 bits3 bits±448 (no inf)Weights and activations (forward)
e5m25 bits2 bits±57344Gradients (backward, more range needed)

fp8 vs int8: floating point preserves dynamic range. Integer 8-bit struggles with outliers in attention activations; fp8's exponent absorbs them naturally. Combined with per-tensor scaling (one scale factor per tensor, computed on-the-fly), fp8 inference matches fp16 to within 0.5% on most benchmarks — and runs 2× faster on H100.

Frameworks: TensorRT-LLM, vLLM (recent), Transformer Engine (NVIDIA's fp8 training library).

5. BitNet b1.58 — The 1.58-Bit Frontier

Microsoft Research, 2024. Replace every weight with a value in 1. log₂(3) ≈ 1.58 bits per weight.

# Quantization-aware training, not post-hoc
def bitnet_quantize(W):
    scale = mean(|W|)
    W_q   = round_clip(W / scale, -1, +1)        # ternary
    return W_q, scale

# Forward pass: matmul becomes adds and subtractions
# (no multiplications for the weight matmul itself)
# Activations stay int8.

The trick: BitNet is trained from scratch with quantization-aware training (QAT), not retrofitted onto a fp16 model. Surprising result: at 3B+ parameters, BitNet b1.58 matches fp16 LLM perplexity. Below 3B, it noticeably underperforms.

Hardware implication: ternary weights enable matmul-free inference (lookup tables and adders only). No real production silicon yet, but FPGA and custom ASIC interest is high.

6. PTQ vs QAT

Post-Training Quantization (PTQ)Quantization-Aware Training (QAT)
WhenAfter training, on a frozen modelDuring training (or finetuning)
CostHours on a single GPUFull training run
Quality~0.3–1% perplexity loss for int4Matches fp16 even at 1.58 bits (BitNet)
ExamplesGPTQ, AWQ, LLM.int8BitNet, OmniQuant

Practitioners almost always use PTQ. QAT is reserved for extreme bit widths (≤2) where PTQ collapses, or for end-to-end edge deployments where model size dictates everything.

Tradeoffs

ChoiceWinCost
Weight-only int4 (GPTQ/AWQ)4× memory; near-fp16 qualityActivations still fp16; less compute speedup
Weight + activation int8Faster compute on int8 siliconOutlier handling is hard
fp8 (H100)2× compute, 2× memory; small accuracy lossHopper-only; needs Transformer Engine
1.58-bit (QAT)~9× memory; matmul-freeMust retrain; no off-the-shelf hardware
Per-group scales (g=128)Better accuracy than per-tensorExtra metadata; slightly slower kernel

FAQ

Why is int4 the sweet spot for local inference?

Memory bandwidth is the bottleneck on consumer GPUs. int4 halves the memory traffic of int8 with only ~0.5% extra perplexity. Below int4, the calibration data and group-size overhead grows, and quality drops faster.

Is GGUF a quantization scheme?

No — GGUF is a file format used by llama.cpp. Inside, it stores weights in one of many quantization schemes: Q4_0, Q4_K_M, Q5_K_S, etc. The "K-quants" use group sizes and mixed bit widths within a layer (more precision on important weights).

Why don't all GPUs support int4 matmul?

Tensor Cores added int4 support in Ampere (A100). Before that, int4 had to be unpacked to int8 inline, which negated half the throughput. fp8 is even more recent — only Hopper (H100) and later have native fp8 matmul.

Does quantization affect LoRA / PEFT?

Yes — QLoRA (Dettmers 2023) keeps base weights in 4-bit NF4 and trains LoRA adapters in fp16. Forward pass dequantizes weights on-the-fly. Reduces VRAM for finetuning a 65B model from 780 GB to 48 GB.

What's NF4?

NormalFloat 4-bit. A non-uniform 4-bit format whose levels are placed at the quantiles of a normal distribution — the empirical distribution of trained weights. About 0.3% better perplexity than uniform int4 at the same bit width. Used by QLoRA.

Do I need calibration data?

For activation-aware methods (AWQ, GPTQ) yes — usually 128–1024 samples representative of your task. For pure weight-only methods (LLM.int8 weight quant), no. For QAT, the entire training corpus.