π§ LLM Internals
Large Language Models are reshaping computing, but their internals remain opaque to most engineers. These visualizations let you step through transformer attention, watch BPE tokenization unfold, and understand why KV caches matter for inference performance.
Transformer Architecture
Interactive attention heatmaps, layer-by-layer forward pass, Q/K/V matrices, and feed-forward networks
Tokenization & BPE
Live tokenizer, BPE merge step-through, vocabulary explorer, and encoding comparisons
KV Cache & Inference
Autoregressive generation, KV cache memory, PagedAttention, MQA/GQA, and continuous batching
GPU Memory Management
CUDA memory allocator, memory pooling, gradient checkpointing, mixed precision, OOM strategies
Inference Serving
Continuous batching, PagedAttention (vLLM), speculative decoding, KV cache optimization, TTFT vs TPS
Model Quantization
INT8/INT4/GPTQ/AWQ, calibration, accuracy-speed tradeoffs for efficient deployment
TurboQuant
Google Research's vector quantization for 6Γ KV cache compression β near-optimal distortion at 3-bit precision
Autoresearch
Karpathy's autonomous AI research loop β agents experiment on LLM training code overnight, ~100 experiments while you sleep