🧠 LLM Internals

Large Language Models are reshaping computing, but their internals remain opaque to most engineers. These visualizations let you step through transformer attention, watch BPE tokenization unfold, and understand why KV caches matter for inference performance.

✦ Live

Transformer Architecture

Interactive attention heatmaps, layer-by-layer forward pass, Q/K/V matrices, and feed-forward networks

✦ Live

Tokenization & BPE

Live tokenizer, BPE merge step-through, vocabulary explorer, and encoding comparisons

✦ Live

KV Cache & Inference

Autoregressive generation, KV cache memory, PagedAttention, MQA/GQA, and continuous batching

✦ Live

GPU Memory Management

CUDA memory allocator, memory pooling, gradient checkpointing, mixed precision, OOM strategies

✦ Live

Inference Serving

Continuous batching, PagedAttention (vLLM), speculative decoding, KV cache optimization, TTFT vs TPS

✦ Live

Model Quantization

INT8/INT4/GPTQ/AWQ, calibration, accuracy-speed tradeoffs for efficient deployment

✦ Live

TurboQuant

Google Research's vector quantization for 6× KV cache compression — near-optimal distortion at 3-bit precision

✦ Live

Autoresearch

Karpathy's autonomous AI research loop — agents experiment on LLM training code overnight, ~100 experiments while you sleep

🧠 LLM Internals

Transformer Architecture

Tokenization & BPE

KV Cache & Inference

GPU Memory Management

Inference Serving

Model Quantization

TurboQuant

Autoresearch

🔗 Related Topics