TurboQuant

Fast Post-Training Quantization — Calibration, Group Sizes, GGUF / AWQ / GPTQ Exports

A "quantization toolkit" packages the messy parts of getting a Llama-class model from fp16 weights to int4 or fp8 deployable artifacts. The pipeline is the same across every modern PTQ stack: pick a method (GPTQ / AWQ / SmoothQuant), pass calibration data through the model to gather statistics, decide on per-channel vs per-group scaling, then export to the format your inference engine expects — GGUF for llama.cpp, AWQ for vLLM, GPTQ for ExLlama, MLX for Apple Silicon. TurboQuant-style tools automate this so a 70B model goes from fp16 to deployable int4 in under an hour.

The Quantization Pipeline

Five stages: load, calibrate, quantize, validate, export.

Load fp16 safetensors Calibrate ~512 samples Quantize layer-by-layer Validate PPL / MMLU Export GGUF · AWQ · GPTQ · MLX · CoreML · ONNX Total time on a single H100: 30–90 min for 70B

Key Numbers

~30 min
GPTQ time on 70B model, single H100
512-2048
Calibration samples · diminishing returns above 1024
128
Default group size (AWQ, GPTQ)
35 GB
Llama-3 70B int4 GPTQ artifact size
~0.3
Perplexity points lost vs fp16 (Llama-2 70B int4)
3-5×
Inference speedup vs fp16 on consumer GPUs (memory-bound)
8-12 GB
VRAM needed for the quantization process itself

1. Calibration Data

A representative sample of inputs that the quantizer will use to compute scaling statistics.

For PTQ to work, the quantizer needs to see what activations look like in practice. Standard practice is to feed 256–2048 sequences from a corpus that resembles the deployment distribution:

# Typical calibration source: C4, WikiText, or a domain corpus
from datasets import load_dataset
calib = load_dataset("allenai/c4", split="train", streaming=True)
samples = []
for ex in calib.take(512):
    tokens = tokenizer(ex["text"], truncation=True, max_length=2048)
    samples.append(tokens["input_ids"])

# Pass these through the model to gather:
#  - per-tensor activation max/min
#  - per-channel activation distributions
#  - second-order info (Hessian) for GPTQ-style methods

Domain mismatch is the most common quantization bug: calibrating on Wikipedia and deploying on code produces noticeable quality loss. Best practice — calibrate on a sample of the actual deployment traffic when possible.

2. Per-Channel vs Per-Group Quantization

Each weight matrix has a 2D shape (out_features, in_features). The scale factor can be applied at different granularities:

GranularityScales per matrixAccuracyOverhead
Per-tensor1Worst~0%
Per-channel (per-row)out_featuresBetter~0.5%
Per-group, g=128out × (in/128)Best~3-5%
Per-group, g=64out × (in/64)Marginally better~6-10%

The overhead is the metadata — extra bytes for scales and (for asymmetric) zero points stored alongside the int4 weights. AWQ uses g=128 with one fp16 scale per group; GPTQ uses g=128 with both scale and zero-point.

3. Layer-By-Layer Quantization (GPTQ Style)

Naive quantization quantizes every layer independently. GPTQ does layer-by-layer with error compensation:

def gptq_layer(W, calibration_inputs):
    # 1. Forward pass to get this layer's input X
    X = run_forward_to_layer(model, layer_idx, calib)

    # 2. Compute Hessian H = X.T @ X (input second moment)
    H = X.T @ X

    # 3. Quantize column-by-column
    for j in range(W.shape[1]):
        q = quantize_int4(W[:, j])
        err = W[:, j] - dequantize(q)
        # Spread error to remaining columns weighted by Hessian
        W[:, j+1:] -= err.unsqueeze(1) * H[j, j+1:] / H[j, j]
        W[:, j] = dequantize(q)
    return W

This is roughly equivalent to OBQ (Optimal Brain Quantization) without the full inverse Hessian, made fast via Cholesky factorization. The per-layer cost is dominated by H = X.T @ X, which is parallel-friendly on a GPU.

4. Export Formats

FormatEngineHardwareBit widths
GGUFllama.cpp / OllamaCPU + Metal + CUDA + ROCm2, 3, 4, 5, 6, 8
AWQ (.awq.safetensors)vLLM, AutoAWQ, TGINVIDIA4 (group 128)
GPTQ (.gptq.safetensors)ExLlama, vLLM, TGINVIDIA2, 3, 4, 8
MLXMLX frameworkApple Silicon (M1/M2/M3)4, 6, 8
ONNX-INT4ONNX Runtimex86 CPU, ARM, NPUs4, 8
TensorRT-LLM checkpointTensorRT-LLMNVIDIA H100/H200/B1004, 8, fp8

GGUF is the lingua franca for desktop/edge deployments — single-file, mmap-friendly, supports many bit widths within one file. AWQ and GPTQ are the GPU-server formats. MLX is Apple-only but increasingly popular for laptops. Most quantization toolkits support multiple exports from the same calibrated state.

5. Hardware Speedup Comparison

Hardwarefp16 baselineint4 speedupfp8 speedupNotes
RTX 4090 (24 GB)~30 tok/s~110 tok/sn/a (no fp8)Memory-bound; int4 unlocks 70B
A100 (80 GB)~40 tok/s~150 tok/sn/aAmpere — int4 via TensorCore int8
H100 (80 GB)~70 tok/s~250 tok/s~140 tok/s (per-tensor)fp8 has half int4's memory savings but wider HW support
M3 Max (128 GB)~10 tok/s~30 tok/sn/aUnified memory wins on big models
MI300X (192 GB)~50 tok/s~180 tok/s~120 tok/sROCm support uneven

Throughput numbers are approximate single-stream decode for Llama-3 70B at batch=1. Bigger batches further amortize the weight load and tilt toward compute-bound, where fp16 closes the gap.

6. Validation: How to Tell If You Wrecked the Model

  • Perplexity on WikiText-2: classic, fast, but only measures language modeling. Aim for <1% increase vs fp16.
  • MMLU, ARC, HellaSwag: knowledge / reasoning benchmarks. Should be within 1-2 points.
  • Domain eval: if you'll deploy on code, run HumanEval. If chat, run MT-Bench.
  • Output-distribution comparison: KL divergence between fp16 and quantized model logits on a held-out set.
  • Long-context degradation: quantization often hurts more at >8k tokens. Test specifically.
# Quick PPL eval
from lm_eval.tasks import wikitext
ppl_fp16 = evaluate(fp16_model, wikitext)
ppl_int4 = evaluate(int4_model, wikitext)
delta = (ppl_int4 - ppl_fp16) / ppl_fp16
assert delta < 0.01, f"Quantization regression: +{delta*100:.1f}%"

Tradeoffs

DecisionProCon
GPTQ vs AWQGPTQ slightly better PPLAWQ much faster, simpler kernel
g=64 vs g=128g=64 better quality2× metadata overhead
int4 weight, fp16 actQuality preservedCompute still bottlenecked by fp16
fp8 weight + act2× compute on H100Hopper-only; tighter accuracy margins
GGUF over nativeUniversal (CPU + GPU)Slightly slower than tuned engines

FAQ

Which method should I pick for production?

For NVIDIA GPU servers: AWQ int4 in vLLM or TensorRT-LLM fp8. For Mac laptops: MLX 4-bit. For CPU-only or mixed: GGUF Q4_K_M. AWQ is the safest default — fast to produce, fast at runtime, near-fp16 quality.

Why does Q4_K_M outperform Q4_0 in llama.cpp?

K-quants use mixed bit widths within a layer. Important rows get more precision (5-6 bits), unimportant rows get less (4 bits). Average bit width is similar but quality is meaningfully higher. The "_M" suffix is "medium" mix; "_S" is smaller, "_L" is larger.

Can I quantize and finetune simultaneously?

Yes — that's QLoRA (Dettmers 2023). Base weights stay frozen in 4-bit NF4; LoRA adapters in fp16 are trained on top. The base weights dequantize on-the-fly during forward.

How do I know when to retrain (QAT) instead of PTQ?

If PTQ at your target bit width loses >2% on benchmarks, you need QAT. For ≥4 bits, PTQ is almost always sufficient. For ≤2 bits, QAT is mandatory.

Does the quantization toolkit need a GPU?

For 7B–13B, you can quantize on CPU but it takes hours. For 70B+, you need at least one GPU with enough VRAM to hold one layer at a time (~5 GB). Most pipelines stream layers off-GPU as soon as they're done to keep VRAM low.

How does this differ from training-time quantization?

PTQ is purely post-hoc; you start with a finished fp16 model. Quantization-aware training (QAT) bakes the quantization noise into the training loop — used for extreme bit widths like BitNet's 1.58 bits. PTQ is 100× faster but caps at ~3-bit quality.