Memory Management

Virtual Memory, Paging, the TLB, and the OOM Killer

Every Linux process sees a 256 TiB virtual address space (on x86_64). None of it is real until the process touches it — at which point the kernel walks a 4-level page table, finds (or allocates) a physical page, and stitches it into the mapping. Almost every memory-management feature on the system — copy-on-write fork, file-backed mmap, swap, page cache, hugepages, KSM, NUMA — falls out of this same paging mechanism.

The work that doesn't fall out of paging falls out of policy: which page to reclaim under pressure, which process to kill when the system runs out, how to balance pages between cgroups. Linux's answers are heuristic, tunable, and occasionally surprising. Knowing the policies is the difference between "the OOM killer picked Postgres again, why?" and a deliberately configured system.

Virtual Address Space (x86_64)

Text (code, RX) 0x00400000 Data + BSS (RW) Heap → grows up (brk/sbrk) mmap region shared libs, anonymous mmap Stack ← grows down 0x7fff… Kernel space (high half) 0xffff800000000000+ Address space: User: 128 TiB (47-bit) Kernel: 128 TiB Total: 256 TiB Page table: 4 levels (PGD→PUD→ PMD→PTE), 4 KiB pages 5-level on Ice Lake+: 128 PiB user space

Key Numbers

4 KiB
default page size on x86_64
2 MiB
huge page (PMD-level mapping)
1 GiB
gigantic page (PUD-level)
128 TiB
user virtual address space (4-level)
~64 / 1500
L1 / L2 TLB entries (typical Intel)
~60 cycles
page-table walk on TLB miss
-1000 to 1000
oom_score_adj range

4-Level Paging on x86_64

/* A 64-bit virtual address is decoded as: */
| 16 bits sign-extension | 9 PGD | 9 PUD | 9 PMD | 9 PTE | 12 page offset |
                           ^               4 levels                ^
                           512 entries each, each entry 8 bytes    4 KiB page

/* The CPU's CR3 register holds the physical address of the PGD (page global dir).
   Each level dereferences indices the bits select, until we land at a 4 KiB page. */

/* Walking by hand on a live system: */
$ sudo cat /proc/<pid>/pagemap | head -c <...>
# pagemap returns one 64-bit entry per virtual page describing PFN, swap, soft-dirty

# Kernel structures
struct mm_struct {
    pgd_t              *pgd;          /* root of the page table tree */
    unsigned long      total_vm;       /* total pages mapped */
    unsigned long      rss_stat[];     /* file/anon/shmem counts */
    struct vm_area_struct *mmap;       /* list of VMAs */
};

struct vm_area_struct {
    unsigned long  vm_start, vm_end;   /* address range */
    pgprot_t       vm_page_prot;       /* permissions */
    unsigned long  vm_flags;           /* VM_READ, VM_WRITE, VM_EXEC, VM_SHARED */
    struct file    *vm_file;           /* file-backed mapping; NULL for anon */
};

The Page Fault Path

// CPU traps into kernel on access to unmapped page
do_page_fault(regs, error_code)
  → __do_page_fault()
  → handle_mm_fault(vma, address, flags)
    ├─ if VMA missing → SIGSEGV
    ├─ if perms wrong → SIGSEGV
    └─ pte resolution:
       ├─ VM_FAULT_OOM         → page allocator fails → OOM killer
       ├─ pte_present + COW    → copy-on-write fork; allocate, copy, remap
       ├─ pte not present, anon → zero a page, map it (anonymous)
       ├─ pte not present, file → page cache lookup; if miss, read from disk (MAJOR)
       └─ pte swapped out      → swap in (MAJOR)

// Watch faults live
$ vmstat 1
 procs ----memory---- ---swap-- -----io----
  r  b   swpd   free   si   so    bi    bo
  2  0      0 25.4G    0    0   1234   456

# Per-process fault counters
$ awk '/VmFlt|MajFlt/' /proc/<pid>/status

The OOM Killer

# Score is roughly RSS + 0.5 * swap, biased by oom_score_adj
$ for p in /proc/[0-9]*; do
    pid=${p#/proc/}
    [ -r $p/oom_score ] || continue
    printf "%-8s %-6s %s\n" $pid $(cat $p/oom_score) $(cat $p/comm)
  done | sort -k2 -nr | head
6789    789    chrome
2341    412    postgres
1023    198    systemd

# Protect a process from OOM
$ echo -1000 > /proc/self/oom_score_adj   # immune
$ echo 1000  > /proc/self/oom_score_adj   # take me first

# Watch OOM kills
$ dmesg -T | grep -i "killed process"
[Sat May 3 14:22:01 2026] Out of memory: Killed process 6789 (chrome)
total-vm:8923456kB, anon-rss:5421000kB, file-rss:0kB

Transparent Huge Pages (THP)

# THP modes: never / madvise / always
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

# 'always' is the latency tax — khugepaged scans VMAs and promotes 4K to 2M.
# Promotion can stall the process for tens to hundreds of ms when defragging.
# Most database vendors recommend 'never' or 'madvise'.

# Disable for the boot
GRUB_CMDLINE_LINUX="transparent_hugepage=never"

# Or runtime
$ echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Use explicit hugepages instead — preallocate, no defrag
$ echo 1024 > /proc/sys/vm/nr_hugepages         # 1024 * 2MB = 2GB
$ mount -t hugetlbfs none /mnt/huge
# Postgres: huge_pages = on; uses MAP_HUGETLB

cgroup v2 Memory Controller

# The controller files in a cgroup directory
/sys/fs/cgroup/myapp.slice/
  ├── memory.current        # current RSS+cache+kernel
  ├── memory.high           # soft throttle (slow when crossed)
  ├── memory.max            # hard limit (OOM-killed when crossed)
  ├── memory.low            # protection (reclaim avoids this much)
  ├── memory.min            # absolute protection (kernel never reclaims)
  ├── memory.swap.current
  ├── memory.swap.max
  ├── memory.events         # counters: low, high, max, oom, oom_kill
  ├── memory.pressure       # PSI: stall time as fraction of wallclock
  └── memory.stat           # detailed breakdown by type

# Set limits via systemd
[Service]
MemoryHigh=8G
MemoryMax=10G
MemorySwapMax=2G

# Watch real-time pressure
$ cat /sys/fs/cgroup/myapp.slice/memory.pressure
some avg10=12.34 avg60=4.12 avg300=1.23 total=12345678
full avg10=2.10  avg60=0.95 avg300=0.30 total=2345678
#  ^^^^^^^^^^^^^^^^^^^^^^^^^ percent of time the cgroup stalled on memory

Tradeoffs

Strengths
  • Demand paging means processes consume memory only as they touch it
  • Page cache transparently buffers all file I/O
  • cgroup v2 provides predictable per-tenant isolation in containers
  • PSI metrics give you 'stall time' instead of guessing from raw counters
Sharp edges
  • OOM killer scoring can pick the wrong victim under skewed workloads
  • THP causes latency spikes; explicit hugepages need preallocation
  • NUMA-unaware allocations cause cross-socket traffic and false sharing
  • Memory leaks in long-running services manifest as gradual swap pressure, not crashes

Frequently Asked Questions

What's the difference between a minor and a major page fault?

A minor page fault happens when the page is already in physical RAM (e.g., another process has it mapped via the page cache, or the page is on the free list waiting to be reclaimed) — the kernel just updates the page tables and resumes the process. Cost: a few microseconds. A major page fault means the page must be read from disk (swapped out, or never loaded from a memory-mapped file). Cost: hundreds of microseconds to tens of milliseconds depending on storage. Look at /proc/<pid>/status (VmFlt vs MajFlt) or 'ps -o min_flt,maj_flt' to see them per-process.

How does the OOM killer choose its victim?

Each process has an oom_score computed roughly as RSS + 0.5 * swap, weighted by oom_score_adj (a value from -1000 to 1000 that biases the score). The kernel walks tasks under memory pressure and picks the highest scorer to kill, sending SIGKILL. You can protect critical processes by writing -1000 to /proc/<pid>/oom_score_adj (effectively disabling OOM killing for them). systemd does this for journald and dbus by default. Caveat: you can't protect all processes; if every task has -1000, the kernel still picks one, just possibly at random.

Why do transparent huge pages sometimes hurt performance?

THP gives you 2 MiB pages instead of 4 KiB, which reduces TLB pressure for workloads with huge contiguous mappings (databases, JVMs). But the kernel achieves this by promoting 4K pages to 2M when khugepaged finds runs of consecutive pages. Two costs: (1) latency spikes when the kernel pauses your process to defragment memory and form a 2M page; (2) write amplification when COW or NUMA migration fragments a 2M page back to 4K. For latency-sensitive databases (Postgres, Redis) the recommendation is to disable THP and use explicit hugepages instead.

What is the TLB and why is it small?

The Translation Lookaside Buffer is a hardware cache of recent virtual-to-physical page table translations. It lives in the CPU's MMU and is consulted on every memory access. It's small (Intel x86: typically 64 entries L1 + ~1500 L2 per core) because it must be very fast — checked every clock cycle. With 4 KiB pages, 64 TLB entries cover only 256 KiB of memory; for larger working sets you get TLB misses and the CPU must walk the 4-level page table from RAM (or its caches) — about a 60-cycle penalty. Huge pages (2 MiB) extend coverage to 128 MiB with the same 64 entries.

How does cgroup v2 memory.high differ from memory.max?

memory.max is a hard limit — when allocations would exceed it, the kernel either kills the process via cgroup OOM or returns -ENOMEM if MEMCG_OOM is disabled. memory.high is a soft throttle — when usage exceeds it, the kernel forces the cgroup to do reclaim work on every allocation, slowing it down without killing it. Setting memory.high gives you predictable degradation under pressure (slow, not dead), while memory.max gives you the safety guarantee. Common pattern: set memory.high to 90% of capacity, memory.max to 100%.

What does 'swap accounting' do?

By default, the cgroup memory controller only counts RSS toward your limit. With CONFIG_MEMCG_SWAP=y and swapaccount=1 boot parameter, swapped-out pages of cgroup tasks also count against memory.swap.max (or memory.memsw.limit in v1). Without it, a task can use all its allowed RAM, get swapped out, and then allocate more — effectively bypassing the limit. Modern distros enable swap accounting by default. Caveat: it adds 1-2% memory overhead.