LLM Quantization

int8, int4, fp8, BitNet — How to Halve, Quarter, or Octuple Your Model on the Same GPU

A 70B-parameter model in fp16 is 140 GB. That doesn't fit on an A100 (80 GB), much less an RTX 4090 (24 GB). Quantization is the family of techniques that map fp16 weights to fewer bits — int8, int4, fp8, even 1.58 bits — while keeping output quality nearly identical. The result: bigger models on smaller GPUs, faster inference (memory-bound, so fewer bits = more bandwidth), and lower power. The state of the art is 4-bit weight + fp16 activation; the frontier is fp8 weight + fp8 activation on H100s and 1.58-bit ternary in research.

The Quantization Map

From fp32 down to ternary. Each step buys memory and bandwidth at the cost of accuracy.

Key Numbers

4×

Memory savings: int4 vs fp16 (140 GB Llama-70B → 35 GB)

~1%

Typical perplexity increase from int4 GPTQ vs fp16

2×

fp8 throughput on H100 vs fp16 (Tensor Core ops)

8.9×

BitNet b1.58 memory reduction vs fp16

128

Group size for AWQ/GPTQ — group of weights sharing one scale

e4m3

fp8 format used for weights and activations (4 exp, 3 mantissa)

512–1024

Calibration samples needed for GPTQ-style PTQ

1. The Basics: Symmetric vs Asymmetric, Per-Tensor vs Per-Channel

Quantization maps a real number x to an integer q via a scale (and optionally a zero point):

# Symmetric (zero point = 0)
scale = max(|w|) / 127
q     = round(w / scale)             # int8 in [-127, 127]
w_hat = q * scale                    # dequantized value

# Asymmetric (better for ReLU outputs that are all positive)
scale = (max(w) - min(w)) / 255
zp    = round(-min(w) / scale)       # zero point
q     = round(w / scale + zp)        # uint8 in [0, 255]
w_hat = (q - zp) * scale

One scale per tensor is cheapest but loses range. Per-channel uses one scale per output row of a weight matrix — the standard. Per-group further subdivides each row into groups of 64 or 128, with a separate scale per group. AWQ and GPTQ both use group size 128.

2. LLM.int8() — bitsandbytes

Tim Dettmers' 2022 paper. The first practical int8 LLM scheme. Insight: outliers wreck quantization unless you handle them separately.

# LLM.int8 mixed-precision matmul:
# 1. Detect outlier columns of activations (> 6.0 absolute value)
# 2. Compute outlier columns in fp16 (small fraction)
# 3. Compute non-outlier in int8 (vast majority)
# 4. Sum results

mask_outlier = (|x|.max(dim=0) > 6.0)
y_int8  = x[:, ~mask_outlier].int8() @ W[~mask_outlier].int8()
y_fp16  = x[:, mask_outlier]         @ W[mask_outlier]
y       = y_int8.dequantize() + y_fp16

Why outliers matter: a few feature dimensions in transformers carry huge magnitudes (the classic 1.7% of features have ~10× higher activations). Standard int8 forced a giant scale factor across all features, destroying precision for the small ones. Splitting them out preserves the small features.

3. GPTQ, AWQ, ExLlama: Weight-Only int4

The mainstream local-LLM stack. Quantize weights to int4 offline; activations stay fp16.

GPTQ (Frantar et al. 2023) does layer-by-layer quantization, using second-order info (the Hessian) to pick which weights to quantize first and how to compensate for error in the remaining weights. Output: about 0.3% perplexity loss for int4 on Llama-2 70B.

# GPTQ pseudocode for one layer
H = compute_hessian(calibration_data)           # input outer-product
for col in weight_cols(W):
    q_col = quantize_int4(col)
    err   = (col - dequant(q_col))
    # Update remaining columns to compensate for this error
    W[:, col+1:] -= err * H[col, col+1:] / H[col, col]

AWQ (Activation-aware Weight Quantization, Lin et al. 2023) takes a different angle: not all weights matter equally. The 1% of weights connected to the largest-magnitude activations dominate output quality. Scale those weights up before quantization (and scale activations down to compensate), so the precision goes where it counts.

ExLlama / ExLlamaV2 are inference engines focused on int4 throughput on consumer GPUs (RTX 3090/4090). They use a custom CUDA kernel that fuses dequantization with matmul and supports mixed bit widths per tensor (some layers at 6-bit, some at 4-bit) for accuracy/speed tuning.

4. fp8: e4m3 and e5m2 (Hopper / Blackwell)

NVIDIA's H100 introduced two 8-bit floating-point formats:

Format	Exponent	Mantissa	Range	Use
e4m3	4 bits	3 bits	±448 (no inf)	Weights and activations (forward)
e5m2	5 bits	2 bits	±57344	Gradients (backward, more range needed)

fp8 vs int8: floating point preserves dynamic range. Integer 8-bit struggles with outliers in attention activations; fp8's exponent absorbs them naturally. Combined with per-tensor scaling (one scale factor per tensor, computed on-the-fly), fp8 inference matches fp16 to within 0.5% on most benchmarks — and runs 2× faster on H100.

Frameworks: TensorRT-LLM, vLLM (recent), Transformer Engine (NVIDIA's fp8 training library).

5. BitNet b1.58 — The 1.58-Bit Frontier

Microsoft Research, 2024. Replace every weight with a value in 1. log₂(3) ≈ 1.58 bits per weight.

# Quantization-aware training, not post-hoc
def bitnet_quantize(W):
    scale = mean(|W|)
    W_q   = round_clip(W / scale, -1, +1)        # ternary
    return W_q, scale

# Forward pass: matmul becomes adds and subtractions
# (no multiplications for the weight matmul itself)
# Activations stay int8.

The trick: BitNet is trained from scratch with quantization-aware training (QAT), not retrofitted onto a fp16 model. Surprising result: at 3B+ parameters, BitNet b1.58 matches fp16 LLM perplexity. Below 3B, it noticeably underperforms.

Hardware implication: ternary weights enable matmul-free inference (lookup tables and adders only). No real production silicon yet, but FPGA and custom ASIC interest is high.

6. PTQ vs QAT

	Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
When	After training, on a frozen model	During training (or finetuning)
Cost	Hours on a single GPU	Full training run
Quality	~0.3–1% perplexity loss for int4	Matches fp16 even at 1.58 bits (BitNet)
Examples	GPTQ, AWQ, LLM.int8	BitNet, OmniQuant

Practitioners almost always use PTQ. QAT is reserved for extreme bit widths (≤2) where PTQ collapses, or for end-to-end edge deployments where model size dictates everything.

Tradeoffs

Choice	Win	Cost
Weight-only int4 (GPTQ/AWQ)	4× memory; near-fp16 quality	Activations still fp16; less compute speedup
Weight + activation int8	Faster compute on int8 silicon	Outlier handling is hard
fp8 (H100)	2× compute, 2× memory; small accuracy loss	Hopper-only; needs Transformer Engine
1.58-bit (QAT)	~9× memory; matmul-free	Must retrain; no off-the-shelf hardware
Per-group scales (g=128)	Better accuracy than per-tensor	Extra metadata; slightly slower kernel

FAQ

Why is int4 the sweet spot for local inference?

Memory bandwidth is the bottleneck on consumer GPUs. int4 halves the memory traffic of int8 with only ~0.5% extra perplexity. Below int4, the calibration data and group-size overhead grows, and quality drops faster.

Is GGUF a quantization scheme?

No — GGUF is a file format used by llama.cpp. Inside, it stores weights in one of many quantization schemes: Q4_0, Q4_K_M, Q5_K_S, etc. The "K-quants" use group sizes and mixed bit widths within a layer (more precision on important weights).

Why don't all GPUs support int4 matmul?

Tensor Cores added int4 support in Ampere (A100). Before that, int4 had to be unpacked to int8 inline, which negated half the throughput. fp8 is even more recent — only Hopper (H100) and later have native fp8 matmul.

Does quantization affect LoRA / PEFT?

Yes — QLoRA (Dettmers 2023) keeps base weights in 4-bit NF4 and trains LoRA adapters in fp16. Forward pass dequantizes weights on-the-fly. Reduces VRAM for finetuning a 65B model from 780 GB to 48 GB.

What's NF4?

NormalFloat 4-bit. A non-uniform 4-bit format whose levels are placed at the quantiles of a normal distribution — the empirical distribution of trained weights. About 0.3% better perplexity than uniform int4 at the same bit width. Used by QLoRA.

Do I need calibration data?

For activation-aware methods (AWQ, GPTQ) yes — usually 128–1024 samples representative of your task. For pure weight-only methods (LLM.int8 weight quant), no. For QAT, the entire training corpus.