TurboQuant
Fast Post-Training Quantization — Calibration, Group Sizes, GGUF / AWQ / GPTQ Exports
A "quantization toolkit" packages the messy parts of getting a Llama-class model from fp16 weights to int4 or fp8 deployable artifacts. The pipeline is the same across every modern PTQ stack: pick a method (GPTQ / AWQ / SmoothQuant), pass calibration data through the model to gather statistics, decide on per-channel vs per-group scaling, then export to the format your inference engine expects — GGUF for llama.cpp, AWQ for vLLM, GPTQ for ExLlama, MLX for Apple Silicon. TurboQuant-style tools automate this so a 70B model goes from fp16 to deployable int4 in under an hour.
The Quantization Pipeline
Five stages: load, calibrate, quantize, validate, export.
Key Numbers
1. Calibration Data
A representative sample of inputs that the quantizer will use to compute scaling statistics.
For PTQ to work, the quantizer needs to see what activations look like in practice. Standard practice is to feed 256–2048 sequences from a corpus that resembles the deployment distribution:
# Typical calibration source: C4, WikiText, or a domain corpus
from datasets import load_dataset
calib = load_dataset("allenai/c4", split="train", streaming=True)
samples = []
for ex in calib.take(512):
tokens = tokenizer(ex["text"], truncation=True, max_length=2048)
samples.append(tokens["input_ids"])
# Pass these through the model to gather:
# - per-tensor activation max/min
# - per-channel activation distributions
# - second-order info (Hessian) for GPTQ-style methods Domain mismatch is the most common quantization bug: calibrating on Wikipedia and deploying on code produces noticeable quality loss. Best practice — calibrate on a sample of the actual deployment traffic when possible.
2. Per-Channel vs Per-Group Quantization
Each weight matrix has a 2D shape (out_features, in_features). The scale factor can be applied at different granularities:
| Granularity | Scales per matrix | Accuracy | Overhead |
|---|---|---|---|
| Per-tensor | 1 | Worst | ~0% |
| Per-channel (per-row) | out_features | Better | ~0.5% |
| Per-group, g=128 | out × (in/128) | Best | ~3-5% |
| Per-group, g=64 | out × (in/64) | Marginally better | ~6-10% |
The overhead is the metadata — extra bytes for scales and (for asymmetric) zero points stored alongside the int4 weights. AWQ uses g=128 with one fp16 scale per group; GPTQ uses g=128 with both scale and zero-point.
3. Layer-By-Layer Quantization (GPTQ Style)
Naive quantization quantizes every layer independently. GPTQ does layer-by-layer with error compensation:
def gptq_layer(W, calibration_inputs):
# 1. Forward pass to get this layer's input X
X = run_forward_to_layer(model, layer_idx, calib)
# 2. Compute Hessian H = X.T @ X (input second moment)
H = X.T @ X
# 3. Quantize column-by-column
for j in range(W.shape[1]):
q = quantize_int4(W[:, j])
err = W[:, j] - dequantize(q)
# Spread error to remaining columns weighted by Hessian
W[:, j+1:] -= err.unsqueeze(1) * H[j, j+1:] / H[j, j]
W[:, j] = dequantize(q)
return W This is roughly equivalent to OBQ (Optimal Brain Quantization) without the full inverse Hessian, made fast via Cholesky factorization. The per-layer cost is dominated by H = X.T @ X, which is parallel-friendly on a GPU.
4. Export Formats
| Format | Engine | Hardware | Bit widths |
|---|---|---|---|
| GGUF | llama.cpp / Ollama | CPU + Metal + CUDA + ROCm | 2, 3, 4, 5, 6, 8 |
| AWQ (.awq.safetensors) | vLLM, AutoAWQ, TGI | NVIDIA | 4 (group 128) |
| GPTQ (.gptq.safetensors) | ExLlama, vLLM, TGI | NVIDIA | 2, 3, 4, 8 |
| MLX | MLX framework | Apple Silicon (M1/M2/M3) | 4, 6, 8 |
| ONNX-INT4 | ONNX Runtime | x86 CPU, ARM, NPUs | 4, 8 |
| TensorRT-LLM checkpoint | TensorRT-LLM | NVIDIA H100/H200/B100 | 4, 8, fp8 |
GGUF is the lingua franca for desktop/edge deployments — single-file, mmap-friendly, supports many bit widths within one file. AWQ and GPTQ are the GPU-server formats. MLX is Apple-only but increasingly popular for laptops. Most quantization toolkits support multiple exports from the same calibrated state.
5. Hardware Speedup Comparison
| Hardware | fp16 baseline | int4 speedup | fp8 speedup | Notes |
|---|---|---|---|---|
| RTX 4090 (24 GB) | ~30 tok/s | ~110 tok/s | n/a (no fp8) | Memory-bound; int4 unlocks 70B |
| A100 (80 GB) | ~40 tok/s | ~150 tok/s | n/a | Ampere — int4 via TensorCore int8 |
| H100 (80 GB) | ~70 tok/s | ~250 tok/s | ~140 tok/s (per-tensor) | fp8 has half int4's memory savings but wider HW support |
| M3 Max (128 GB) | ~10 tok/s | ~30 tok/s | n/a | Unified memory wins on big models |
| MI300X (192 GB) | ~50 tok/s | ~180 tok/s | ~120 tok/s | ROCm support uneven |
Throughput numbers are approximate single-stream decode for Llama-3 70B at batch=1. Bigger batches further amortize the weight load and tilt toward compute-bound, where fp16 closes the gap.
6. Validation: How to Tell If You Wrecked the Model
- Perplexity on WikiText-2: classic, fast, but only measures language modeling. Aim for <1% increase vs fp16.
- MMLU, ARC, HellaSwag: knowledge / reasoning benchmarks. Should be within 1-2 points.
- Domain eval: if you'll deploy on code, run HumanEval. If chat, run MT-Bench.
- Output-distribution comparison: KL divergence between fp16 and quantized model logits on a held-out set.
- Long-context degradation: quantization often hurts more at >8k tokens. Test specifically.
# Quick PPL eval
from lm_eval.tasks import wikitext
ppl_fp16 = evaluate(fp16_model, wikitext)
ppl_int4 = evaluate(int4_model, wikitext)
delta = (ppl_int4 - ppl_fp16) / ppl_fp16
assert delta < 0.01, f"Quantization regression: +{delta*100:.1f}%" Tradeoffs
| Decision | Pro | Con |
|---|---|---|
| GPTQ vs AWQ | GPTQ slightly better PPL | AWQ much faster, simpler kernel |
| g=64 vs g=128 | g=64 better quality | 2× metadata overhead |
| int4 weight, fp16 act | Quality preserved | Compute still bottlenecked by fp16 |
| fp8 weight + act | 2× compute on H100 | Hopper-only; tighter accuracy margins |
| GGUF over native | Universal (CPU + GPU) | Slightly slower than tuned engines |
FAQ
Which method should I pick for production?
For NVIDIA GPU servers: AWQ int4 in vLLM or TensorRT-LLM fp8. For Mac laptops: MLX 4-bit. For CPU-only or mixed: GGUF Q4_K_M. AWQ is the safest default — fast to produce, fast at runtime, near-fp16 quality.
Why does Q4_K_M outperform Q4_0 in llama.cpp?
K-quants use mixed bit widths within a layer. Important rows get more precision (5-6 bits), unimportant rows get less (4 bits). Average bit width is similar but quality is meaningfully higher. The "_M" suffix is "medium" mix; "_S" is smaller, "_L" is larger.
Can I quantize and finetune simultaneously?
Yes — that's QLoRA (Dettmers 2023). Base weights stay frozen in 4-bit NF4; LoRA adapters in fp16 are trained on top. The base weights dequantize on-the-fly during forward.
How do I know when to retrain (QAT) instead of PTQ?
If PTQ at your target bit width loses >2% on benchmarks, you need QAT. For ≥4 bits, PTQ is almost always sufficient. For ≤2 bits, QAT is mandatory.
Does the quantization toolkit need a GPU?
For 7B–13B, you can quantize on CPU but it takes hours. For 70B+, you need at least one GPU with enough VRAM to hold one layer at a time (~5 GB). Most pipelines stream layers off-GPU as soon as they're done to keep VRAM low.
How does this differ from training-time quantization?
PTQ is purely post-hoc; you start with a finished fp16 model. Quantization-aware training (QAT) bakes the quantization noise into the training loop — used for extreme bit widths like BitNet's 1.58 bits. PTQ is 100× faster but caps at ~3-bit quality.