Memory Management
Virtual Memory, Paging, the TLB, and the OOM Killer
Every Linux process sees a 256 TiB virtual address space (on x86_64). None of it is real until the process touches it — at which point the kernel walks a 4-level page table, finds (or allocates) a physical page, and stitches it into the mapping. Almost every memory-management feature on the system — copy-on-write fork, file-backed mmap, swap, page cache, hugepages, KSM, NUMA — falls out of this same paging mechanism.
The work that doesn't fall out of paging falls out of policy: which page to reclaim under pressure, which process to kill when the system runs out, how to balance pages between cgroups. Linux's answers are heuristic, tunable, and occasionally surprising. Knowing the policies is the difference between "the OOM killer picked Postgres again, why?" and a deliberately configured system.
Virtual Address Space (x86_64)
Key Numbers
4-Level Paging on x86_64
/* A 64-bit virtual address is decoded as: */
| 16 bits sign-extension | 9 PGD | 9 PUD | 9 PMD | 9 PTE | 12 page offset |
^ 4 levels ^
512 entries each, each entry 8 bytes 4 KiB page
/* The CPU's CR3 register holds the physical address of the PGD (page global dir).
Each level dereferences indices the bits select, until we land at a 4 KiB page. */
/* Walking by hand on a live system: */
$ sudo cat /proc/<pid>/pagemap | head -c <...>
# pagemap returns one 64-bit entry per virtual page describing PFN, swap, soft-dirty
# Kernel structures
struct mm_struct {
pgd_t *pgd; /* root of the page table tree */
unsigned long total_vm; /* total pages mapped */
unsigned long rss_stat[]; /* file/anon/shmem counts */
struct vm_area_struct *mmap; /* list of VMAs */
};
struct vm_area_struct {
unsigned long vm_start, vm_end; /* address range */
pgprot_t vm_page_prot; /* permissions */
unsigned long vm_flags; /* VM_READ, VM_WRITE, VM_EXEC, VM_SHARED */
struct file *vm_file; /* file-backed mapping; NULL for anon */
}; The Page Fault Path
// CPU traps into kernel on access to unmapped page
do_page_fault(regs, error_code)
→ __do_page_fault()
→ handle_mm_fault(vma, address, flags)
├─ if VMA missing → SIGSEGV
├─ if perms wrong → SIGSEGV
└─ pte resolution:
├─ VM_FAULT_OOM → page allocator fails → OOM killer
├─ pte_present + COW → copy-on-write fork; allocate, copy, remap
├─ pte not present, anon → zero a page, map it (anonymous)
├─ pte not present, file → page cache lookup; if miss, read from disk (MAJOR)
└─ pte swapped out → swap in (MAJOR)
// Watch faults live
$ vmstat 1
procs ----memory---- ---swap-- -----io----
r b swpd free si so bi bo
2 0 0 25.4G 0 0 1234 456
# Per-process fault counters
$ awk '/VmFlt|MajFlt/' /proc/<pid>/status The OOM Killer
# Score is roughly RSS + 0.5 * swap, biased by oom_score_adj
$ for p in /proc/[0-9]*; do
pid=${p#/proc/}
[ -r $p/oom_score ] || continue
printf "%-8s %-6s %s\n" $pid $(cat $p/oom_score) $(cat $p/comm)
done | sort -k2 -nr | head
6789 789 chrome
2341 412 postgres
1023 198 systemd
# Protect a process from OOM
$ echo -1000 > /proc/self/oom_score_adj # immune
$ echo 1000 > /proc/self/oom_score_adj # take me first
# Watch OOM kills
$ dmesg -T | grep -i "killed process"
[Sat May 3 14:22:01 2026] Out of memory: Killed process 6789 (chrome)
total-vm:8923456kB, anon-rss:5421000kB, file-rss:0kB Transparent Huge Pages (THP)
# THP modes: never / madvise / always
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# 'always' is the latency tax — khugepaged scans VMAs and promotes 4K to 2M.
# Promotion can stall the process for tens to hundreds of ms when defragging.
# Most database vendors recommend 'never' or 'madvise'.
# Disable for the boot
GRUB_CMDLINE_LINUX="transparent_hugepage=never"
# Or runtime
$ echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Use explicit hugepages instead — preallocate, no defrag
$ echo 1024 > /proc/sys/vm/nr_hugepages # 1024 * 2MB = 2GB
$ mount -t hugetlbfs none /mnt/huge
# Postgres: huge_pages = on; uses MAP_HUGETLB cgroup v2 Memory Controller
# The controller files in a cgroup directory
/sys/fs/cgroup/myapp.slice/
├── memory.current # current RSS+cache+kernel
├── memory.high # soft throttle (slow when crossed)
├── memory.max # hard limit (OOM-killed when crossed)
├── memory.low # protection (reclaim avoids this much)
├── memory.min # absolute protection (kernel never reclaims)
├── memory.swap.current
├── memory.swap.max
├── memory.events # counters: low, high, max, oom, oom_kill
├── memory.pressure # PSI: stall time as fraction of wallclock
└── memory.stat # detailed breakdown by type
# Set limits via systemd
[Service]
MemoryHigh=8G
MemoryMax=10G
MemorySwapMax=2G
# Watch real-time pressure
$ cat /sys/fs/cgroup/myapp.slice/memory.pressure
some avg10=12.34 avg60=4.12 avg300=1.23 total=12345678
full avg10=2.10 avg60=0.95 avg300=0.30 total=2345678
# ^^^^^^^^^^^^^^^^^^^^^^^^^ percent of time the cgroup stalled on memory Tradeoffs
- Demand paging means processes consume memory only as they touch it
- Page cache transparently buffers all file I/O
- cgroup v2 provides predictable per-tenant isolation in containers
- PSI metrics give you 'stall time' instead of guessing from raw counters
- OOM killer scoring can pick the wrong victim under skewed workloads
- THP causes latency spikes; explicit hugepages need preallocation
- NUMA-unaware allocations cause cross-socket traffic and false sharing
- Memory leaks in long-running services manifest as gradual swap pressure, not crashes
Frequently Asked Questions
What's the difference between a minor and a major page fault?
A minor page fault happens when the page is already in physical RAM (e.g., another process has it mapped via the page cache, or the page is on the free list waiting to be reclaimed) — the kernel just updates the page tables and resumes the process. Cost: a few microseconds. A major page fault means the page must be read from disk (swapped out, or never loaded from a memory-mapped file). Cost: hundreds of microseconds to tens of milliseconds depending on storage. Look at /proc/<pid>/status (VmFlt vs MajFlt) or 'ps -o min_flt,maj_flt' to see them per-process.
How does the OOM killer choose its victim?
Each process has an oom_score computed roughly as RSS + 0.5 * swap, weighted by oom_score_adj (a value from -1000 to 1000 that biases the score). The kernel walks tasks under memory pressure and picks the highest scorer to kill, sending SIGKILL. You can protect critical processes by writing -1000 to /proc/<pid>/oom_score_adj (effectively disabling OOM killing for them). systemd does this for journald and dbus by default. Caveat: you can't protect all processes; if every task has -1000, the kernel still picks one, just possibly at random.
Why do transparent huge pages sometimes hurt performance?
THP gives you 2 MiB pages instead of 4 KiB, which reduces TLB pressure for workloads with huge contiguous mappings (databases, JVMs). But the kernel achieves this by promoting 4K pages to 2M when khugepaged finds runs of consecutive pages. Two costs: (1) latency spikes when the kernel pauses your process to defragment memory and form a 2M page; (2) write amplification when COW or NUMA migration fragments a 2M page back to 4K. For latency-sensitive databases (Postgres, Redis) the recommendation is to disable THP and use explicit hugepages instead.
What is the TLB and why is it small?
The Translation Lookaside Buffer is a hardware cache of recent virtual-to-physical page table translations. It lives in the CPU's MMU and is consulted on every memory access. It's small (Intel x86: typically 64 entries L1 + ~1500 L2 per core) because it must be very fast — checked every clock cycle. With 4 KiB pages, 64 TLB entries cover only 256 KiB of memory; for larger working sets you get TLB misses and the CPU must walk the 4-level page table from RAM (or its caches) — about a 60-cycle penalty. Huge pages (2 MiB) extend coverage to 128 MiB with the same 64 entries.
How does cgroup v2 memory.high differ from memory.max?
memory.max is a hard limit — when allocations would exceed it, the kernel either kills the process via cgroup OOM or returns -ENOMEM if MEMCG_OOM is disabled. memory.high is a soft throttle — when usage exceeds it, the kernel forces the cgroup to do reclaim work on every allocation, slowing it down without killing it. Setting memory.high gives you predictable degradation under pressure (slow, not dead), while memory.max gives you the safety guarantee. Common pattern: set memory.high to 90% of capacity, memory.max to 100%.
What does 'swap accounting' do?
By default, the cgroup memory controller only counts RSS toward your limit. With CONFIG_MEMCG_SWAP=y and swapaccount=1 boot parameter, swapped-out pages of cgroup tasks also count against memory.swap.max (or memory.memsw.limit in v1). Without it, a task can use all its allowed RAM, get swapped out, and then allocate more — effectively bypassing the limit. Modern distros enable swap accounting by default. Caveat: it adds 1-2% memory overhead.