Filesystems
VFS, ext4, XFS, Btrfs, ZFS, tmpfs, FUSE — One Interface, Many Engines
Linux supports dozens of filesystems through a single abstraction: the Virtual File
System layer. Userspace calls read(), write(), stat();
VFS resolves the path through cached dentries to an inode, looks up the inode's
file_operations table, and dispatches to whichever filesystem implements it. The
same syscall reaches ext4, a remote NFS share, a FUSE-mounted S3 bucket, or
/proc — and userspace is none the wiser.
The choice of filesystem shapes everything below the line: how data is laid out, how crashes are recovered, how metadata scales, how snapshots work. Modern Linux offers a spectrum from the workhorse simplicity of ext4 to the copy-on-write magic of Btrfs and ZFS to the in-memory performance of tmpfs to the kitchen-sink flexibility of FUSE.
The Stack
From a syscall to physical bytes.
Key Numbers
VFS: The Interface Every Filesystem Implements
The Virtual File System (VFS) is not a filesystem — it's a kernel subsystem that
defines the contract between syscall handlers and concrete filesystems. Every
filesystem registers by filling in a struct super_block (for a mount)
and populating its inode_operations and file_operations function tables. The kernel
never calls filesystem-specific code directly; it always goes through these tables.
The Four Core VFS Objects
VFS defines four object types, each with a lifecycle managed by reference counting:
struct super_block { /* One per mount point */
struct list_head s_list; /* global super_blocks list */
dev_t s_dev; /* device identifier */
unsigned long s_blocksize;
unsigned long s_blocksize_bits;
loff_t s_maxbytes; /* max file size */
struct file_system_type *s_type; /* filesystem type (ext4, xfs, ...) */
const struct super_operations *s_op; /* alloc_inode, write_inode, ... */
const struct dquot_operations *dq_op;
const struct export_operations *s_export_op;
struct dentry *s_root; /* root dentry of this mount */
struct rw_semaphore s_umount;
/* Filesystem-specific state (filled by ->fill_super) */
union {
struct ext4_sb_info ext4;
struct xfs_sb xfs;
struct btrfs_fs_info btrfs;
...
} s_fs_info;
};
struct inode { /* One per file (on disk inode + VFS cache entry) */
umode_t i_mode; /* type + permissions */
kuid_t i_uid;
kgid_t i_gid;
loff_t i_size;
struct timespec64 i_atime; /* last access */
struct timespec64 i_mtime; /* last modification */
struct timespec64 i_ctime; /* last metadata change */
struct timespec64 i_crtime; /* creation time (ext4, btrfs) */
unsigned long i_ino; /* inode number, unique per FS instance */
unsigned int i_nlink; /* hard link count */
const struct inode_operations *i_op;
const struct file_operations *i_fop; /* set by ->alloc_inode */
struct address_space *i_mapping; /* page cache root */
struct list_head i_shared;
/* NFSv4 state */
unsigned long i_state;
/* Extended attributes */
const struct xattr_handler **i_xattr;
};
struct dentry { /* One per path component ("etc", "hosts") */
unsigned int d_flags;
seqcount_t d_seq;
struct hlist_node d_hash; /* dentry_hashtable chain */
struct dentry *d_parent; /* parent directory */
struct qstr d_name; /* the name bytes + hash */
struct inode *d_inode; /* inode this dentry points to */
unsigned char d_iname[DNAME_INLINE_LEN]; /* short names stored inline */
struct list_head d_lru; /* LRU list for eviction */
struct list_head d_child; /* parent's children list */
struct list_head d_subdirs; /* subdirectories */
/* filesystem-private data */
union {
void *d_fsdata; /* ext4: inline data, etc. */
struct nameidata *d_nameidata; /* for SLAB poison detection */
} d_u;
const struct dentry_operations *d_op;
struct super_block *d_sb; /* super_block of the backing FS */
unsigned long d_time; /* used by d_revalidate */
};
struct file { /* One per open file descriptor */
struct path f_path; /* path + mount */
struct inode *f_inode;
const struct file_operations *f_op; /* read, write, mmap, ... */
fmode_t f_mode; /* FREAD, FWRITE, ... */
loff_t f_pos; /* file position (per-FD) */
unsigned int f_flags; /* O_RDONLY, O_NONBLOCK, O_DIRECT, ... */
atomic_long_t f_count; /* reference count */
struct fown_struct f_owner;
struct wait_queue_entry *f_wait; /* for non-blocking I/O */
struct address_space *f_mapping;
unsigned long f_version;
void *f_security;
};
/* Reference counting: each struct has an atomic_t i_count or similar.
* - Increment: iget() / d_instantiate() / atomic_inc()
* - Decrement + free: iput() / dput() -> iput_final() / call_rcu() for async free
*
* inode shrink list (inode_unused): doubly-linked list of inodes with i_count==0,
* used by the inode reclaim daemon (kswapd) to find victims for eviction.
*/ How Path Resolution Works
The kernel's path_lookupat() function walks a path string component by
component. It starts at the root dentry (or the current directory's dentry) and
for each component:
/* Simplified path resolution flow (path_lookupat / link_path_walk):
*
* 1. Start with current->fs->root (or current->fs->pwd) as the starting dentry.
*
* 2. For each path component (e.g., "/etc/hosts" → "etc" then "hosts"):
*
* a. d_hash the component name into a qstr (name + hash).
*
* b. Look up in dcache:
* - dcache is a big hash table (dentry_hashtable, ~4096 buckets for 4K pages).
* - Hash key = (parent dentry, component name hash).
* - If found in dcache → use it (rcu_read_lock held during the fast path).
* - If not found → call down to filesystem via:
* dentry->d_op->d_lookup() → ext4_lookup(), xfs_lookup(), ...
* which reads the on-disk directory entry and creates a dentry.
*
* c. d_instantiate() attaches the inode to the dentry.
*
* d. If it's a symlink and the lookup flags allow following, resolve the symlink.
* (remaining_path is stored in nameidata for deferred handling.)
*
* e. Move to the child dentry; repeat for next component.
*
* 3. If the final component is a directory, pin it in the dentry cache.
* The file descriptor's struct file->f_path.dentry points to it.
*
* Important: the VFS dentry cache is global across all mounts and all CPUs.
* rcu_read_lock is used for lock-free lookup in the fast path. When dentries
* are reclaimed (dput), they are freed via call_rcu() so the memory isn't
* touched until all in-flight RCU readers finish.
*/
/* The dentry hash table (dcache) */
#define D_HASHMASK (PAGE_SIZE / sizeof(struct dentry *) - 1)
struct dentry *dentry_hashtable[...]; /* global, per-CPU list heads */
/* Hash function: parent inode + qstr hash → bucket index.
* d_hash() is called before d_lookup(); filesystem can add its own
* per-dentry hash (e.g., case-insensitive filesystems store case fold here).
*/
/* dcache stats (cat /proc/sys/vm/dentry-state):
* nr_dentry — total dentries allocated
* nr_unused — dentries on the LRU (no active references)
* age_centisecs — aging rate for the dcache
* want_pages — shrinker called, pages wanted
*/ The VFS Function Tables
/* Every filesystem fills these at registration time: */
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode)(struct inode *, int flags);
int (*write_inode)(struct inode *, struct writeback_control *wbc);
void (*drop_inode)(struct inode *); /* called on last iput */
void (*evict_inode)(struct inode *); /* final cleanup */
void (*put_super)(struct super_block *); /* umount */
int (*sync_fs)(struct super_block *, int wait);
int (*freeze_super)(struct super_block *);
int (*thaw_super)(struct super_block *);
...
};
struct inode_operations {
int (*create)(struct inode *, struct dentry *, umode_t, bool);
struct dentry *(*lookup)(struct inode *, struct dentry *, unsigned int flags);
int (*link)(struct dentry *, struct inode *, struct dentry *);
int (*unlink)(struct inode *, struct dentry *);
int (*mkdir)(struct inode *, struct dentry *, umode_t);
int (*rmdir)(struct inode *, struct dentry *);
int (*mknod)(struct inode *, struct dentry *, umode_t, dev_t);
int (*rename)(struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*setattr)(struct dentry *, struct iattr *);
void (*getattr)(const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr)(struct dentry *, char *, size_t);
int (*removexattr)(struct dentry *, const char *);
int (*set_acl)(struct inode *, struct posix_acl *, int);
...
};
struct file_operations { /* set on inode via ->alloc_inode */
loff_t (*llseek)(struct file *, loff_t, int);
ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
int (*iterate)(struct file *, struct dir_context *);
int (*fsync)(struct file *, loff_t, loff_t, int);
int (*aio_fsync)(struct kiocb *, int);
int (*fsync_range)(struct file *, loff_t, loff_t, int);
int (*open)(struct inode *, struct file *);
int (*flush)(struct file *, fl_owner_t);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*mmap)(struct file *, struct vm_area_struct *);
long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
int (*compat_ioctl)(struct file *, unsigned int, unsigned long);
int (*readdir)(struct file *, void *, filldir_t); /* deprecated */
...
};
/* The syscall handler read() does:
* fdget(fd) → struct file *f
* vfs_read(f, buf, count, &f->f_pos) → f->f_op->read_iter(iocb, iov_iter)
* fdput(f)
*/ inode_operations vs i_fop: The Distinction
i_op defines operations on the inode itself — creating files in a
directory, removing links, renaming, setting attributes, looking up a name within a
directory. i_fop defines operations on an open file handle — reading,
writing, seeking, memory-mapping, fsyncing. The confusion arises because both live in
the inode struct; i_fop is typically set once during inode allocation and
doesn't change. For regular files, i_fop points to the generic file
operations that go through the page cache. For directories, i_op handles
mkdir, rmdir, lookup.
The Page Cache
The page cache is the kernel's main cache for file data. It's the bridge between files (which are named byte ranges) and physical RAM (which is organized as pages). Every buffered read/write goes through the page cache. The radix tree indexes pages by file offset, making O(1) lookup for any byte offset.
The Radix Tree
/* struct address_space sits inside struct inode:
* inode->i_mapping → struct address_space
* address_space->page_tree → struct radix_tree_root
*
* The radix tree maps (page_index, page_size) → struct page *
* Each entry represents one page-aligned, PAGE_SIZE chunk of the file.
*
* For a 1GB file: 1GB / 4096 = 262144 entries (256K entries)
* The radix tree height: ceil(log2(262144)) = 18 levels → actually compressed.
* With RADIX_TREE_MAX_SHIFT=64, sparse entries don't allocate intermediate nodes.
*/
struct address_space {
struct inode *host; /* owner inode (or NULL for swap) */
struct radix_tree_root page_tree; /* root of the radix tree */
spinlock_t tree_lock; /* protects page_tree */
atomic_t i_mmap_writable;
struct rb_root_cached i_mmap; /* VMAs mapping this file */
struct list_head i_mmap_nonlinear;
rwlock_t i_mmap_lock;
atomic_t truncate_count;
unsigned long nrpages; /* total cached pages */
const struct address_space_operations *a_ops;
...
};
struct address_space_operations { /* the FS implements this */
int (*writepage)(struct page *, struct writeback_control *);
int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *,
struct writeback_control *);
int (*read_folio)(struct file *, struct folio *);
int (*write_begin)(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page **, void **);
int (*write_end)(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio)(struct folio *, size_t, size_t);
int (*releasepage)(struct page *, int);
int (*free_folio)(struct folio *);
...
};
/* The radix tree is a sparse-indexed tree:
* - radix_tree_insert(page_tree, index, page) → insert page at file offset index
* - radix_tree_lookup(page_tree, index) → find page at offset index
* - radix_tree_delete(page_tree, index) → remove page
*
* Operations are lockless under RCU for lookups (tree_lock held for mutations).
* Huge pages (THP) reduce tree depth for large sequential files.
*/
/* Finding a page for a byte offset:
* folio = find_get_entry(mapping, index); // index = offset >> PAGE_SHIFT
* if (!folio) → readpage() to bring it in
*
* Writing dirty pages:
* lock_page(page);
* SetPageDirty(page);
* unlock_page(page);
* // writeback happens later via writeback workers
*/ Writeback: How Dirty Pages Reach Disk
Dirty pages accumulate in the page cache and are written back by a combination of
flusher threads and background BDI (Backing Device Info) work. The kernel's
writeback is driven by several conditions: memory pressure (pages can't stay dirty
forever), the sync/fsync syscalls, and per-filesystem thresholds.
/* Writeback triggers come from:
* 1. pdflush / flusher threads (wakeup_flusher_threads / bdi_writeback_task)
* - triggered periodically by dirty_writeback_centisecs (default 500 = 5 seconds)
* - kupdate: writes pages older than dirty_expire_centisecs (default 3000 = 30 sec)
*
* 2. Memory pressure: when free RAM drops below the dirty background threshold
* (dirty_ratio / dirty_background_ratio sysctl)
*
* 3. Sync syscall: sync(2), fsync(2), fdatasync(2)
* - sync: sb->s_op->sync_fs() on all mounts, then writeback_inode() all dirty pages
* - fsync: filemap_fdatawait() for the inode's mapping, then journal commit
*
* 4. umount: writeback_inode() all remaining dirty pages before releasing super_block
*/
/* The BDI (Backing Device Info) represents one block device's writeback state.
* It holds the dirty limits, the flusher thread, and the queued writeback work.
* bdi_writeback_workfn() is the main loop.
*/
struct bdi_writeback {
struct task_struct *task;
unsigned long dirty_stamp;
unsigned long oldest_jif;
struct list_head b_io; /* buffered I/O to flush */
struct list_head b_dirty; /* pages to write */
struct list_head b_writeback;/* pages actively being written */
unsigned long wb_cpu;
...
};
/* writepages callback (address_space_operations):
* The filesystem iterates its own metadata (inodes, extents) to find dirty pages.
* ext4: walks the inode's extent tree looking for buffers with BH_dirty flag.
* XFS: walks the per-AG btrees and the log to find dirty metadata.
* Btrfs: walks the extent tree looking for dirty tree nodes.
*
* The writeback_control tells the filesystem how much to write:
* wbc->nr_to_write — pages requested to write
* wbc->sync_mode — WB_SYNC_NONE (background) vs WB_SYNC_ALL (fsync)
* wbc->range_start/end — byte range to write (for partial fsync)
*/
/* fsync() specifically:
* 1. file->f_op->fsync(file, 0, LLONG_MAX, 0) → ext4_sync_file() / xfs_file_fsync()
* 2. The filesystem: writes all data pages for this inode (filemap_fdatawait_range)
* 3. The filesystem: commits the journal transaction
* 4. Only then does fsync() return to userspace
*
* This is why ext4/XFS fsync is slower than tmpfs: disk I/O is mandatory.
*/ Buffered vs Direct I/O
/* Buffered I/O (default):
* read(fd, buf, 4096) →
* vfs_read() → file->f_op->read_iter()
* generic_file_read_iter() → filemap_read()
* - find_get_page() in page cache
* - if miss: do_page_cache_seek() → ops->readpage() → read from disk into page
* - copy_page_to_iter() → copy from struct page to userspace buf
*
* write(fd, buf, 4096) →
* vfs_write() → file->f_op->write_iter()
* generic_file_write_iter() → filemap_write()
* - find page in cache (or create with ops->write_begin)
* - copy_from_iter_to_page() → copy from userspace to struct page
* - SetPageDirty() → page goes into writeback queue
* - file_update_time() marks inode dirty
*
* Direct I/O (O_DIRECT):
* - skips the page cache entirely
* - bio submitted directly to block layer
* - pages must be aligned, I/O size aligned to filesystem block
* - Bounce page used if userspace buffer not aligned for DMA
*
* Since Linux 5.19, io_uring with O_DIRECT works on regular files without
* requiring the filesystem to be explicitly mounted with -o dio.
* Registered ring buffers in io_uring bypass copy overhead for both modes.
*/
/* Page dirtying rate control (sysctl):
* dirty_background_bytes / dirty_background_ratio — background writeback starts
* dirty_bytes / dirty_ratio — blocking writeback starts
* dirty_expire_centisecs — pages considered old
* dirty_writeback_centisecs — flusher thread wake interval
*
* The vm.dirty_* tunables in /proc/sys/vm control when writeback kicks in.
* On a fast NVMe with lots of RAM, defaults may not trigger writeback fast enough
* to avoid large dirty spikes on crash. Setting dirty_ratio to 5-10% is common.
*/ Inode Cache (icache)
The inode cache is the VFS layer's in-memory cache of disk inodes. Every file currently in use has a struct inode in memory, regardless of whether its data pages are cached. The icache is managed as a Hashtable of inode structures indexed by (super_block, inode_ino). Unused inodes (i_count == 0) live on a shrink list that kswapd evicts when memory is low. This is distinct from the page cache: an inode can be in the icache without any of its file data in the page cache.
/* The inode hash table (inode_hashtable):
* static struct inet_hashtables __cacheline_aligned_in_smp inet_hashtables;
* Actually: struct hlist_head *inode_hashtable; // hashed by (superblock, ino)
*
* inode_hash(sb, ino) = inode_hashtable[hash肩_long(sb, ino) & mask]
*/
/* Allocating an inode:
* - iget5_locked(sb, hashval, test, set, data) → find or create
* - if found: i_count++
* - if not found: alloc_inode(sb) → sb->s_op->alloc_inode()
* - ext4: alloc_inode() → kmem_cache_alloc(inode_cache)
* - xfs: xfs_alloc_inode() → kmem_cache_alloc(xfs_inode_zone)
* - btrfs: btrfs_alloc_inode() → kmem_cache_alloc(btrfs_inode_zone)
* - fill with defaults, add to hash table
* - unlock inode (initialized)
*
* Evicting an inode (inode_final):
* - write_inode(inode, NULL) if dirty
* - truncate_inode_pages(&inode->i_data, 0) — release all page cache pages
* - ext4_xattr_inode_delete()
* - invalidate_inode_buffers()
* - remove_from_inode_cache() → hlist_del_rcu()
* - destroy_inode() → sb->s_op->destroy_inode() → kmem_cache_free()
*/
/* Inode timestamps and their semantics:
* i_atime — updated on every read() (disabled by noatime mount option)
* i_mtime — updated when file data changes (ctime also changes)
* i_ctime — updated when metadata changes (mtime, mode, owner, xattr, link count)
* i_crtime — creation time (ext4, btrfs); not in POSIX; visible via statx()
*
* Note: i_mtime changes DON'T imply i_ctime changes on some filesystems (btrfs CoW
* can leave ctime behind when only data pointers change). Actually: btrfs does update
* ctime on data changes since it updates the inode item in the extent tree.
*
* Extended attributes (xattrs): stored inline in the inode (ext4 inline xattrs)
* or as a separate block. namespace.system.posix_acl_default:
* - user.* — user namespace, not preserved across mount
* - trusted.* — kernel + privileged tools only
* - security.* — SELinux, SMACK, capabilities
* - system.* — POSIX ACLs (getfacl/setfacl)
*/ Dentry Cache (dcache)
The dentry cache is the heart of Linux's path resolution performance. Every component of every path ever resolved is cached as a struct dentry. When you open a file the second time, the kernel finds it in the hash table in O(1) rather than walking the directory tree. The dentry cache also holds positive and negative entries: "no such file" results are cached too, preventing repeated failed lookups.
/* Dcache structure:
*
* Global hash table: dentry_hashtable
* - Bucket = hlist_head (struct dentry * head)
* - Hash key: parent dentry pointer + qstr hash of the component name
*
* d_hash(parent, name) → bucket
* d_lookup(parent, name) → dentry * or NULL
* - Fast path: rcu_read_lock(), hash lookup, rcu_read_unlock()
* - Slow path: tree_lock held, repeated lookup under lock
*
* d_instantiate(dentry, inode) → attach inode to dentry, insert into hash
*
* When the last reference to a dentry is dropped (dput):
* - if DCACHE_REFERENCED: move to LRU head, clear flag
* - else: free immediately or after aging
*
* When dcache needs memory (under memory pressure):
* - prune_dcache_sb() walks the per-sb dentry list, shrinks old entries
* - shrink_dentry_list() does the actual eviction
*
* d_drop: removes from the hash table so the dentry slot can be reused.
* Called when a directory entry is deleted/renamed and we need to invalidate
* the cached lookup. Then d_revalidate() will do a fresh lookup.
*
* The dentry_lru: separate LRU list from the dcache hash.
* Entries with d_flags & DCACHE_REFERENCED are on the LRU at tail.
* Scanning from tail = least recently used.
*/
/* dcache behavior in path resolution:
*
* open("/etc/hosts"):
* 1. dcache lookup for "" (root) → hits immediately (already cached)
* 2. lookup "etc" in root's children → likely hit in dcache
* 3. lookup "hosts" in "etc" dentry's children → hit or miss
* - miss: call ext4_lookup(inode, dentry)
* - read directory blocks from disk
* - find "hosts" entry
* - iget5_locked(sb, ino) → get/create struct inode
* - d_instantiate(dentry, inode) → enter into dcache hash
* 4. dentry for "/etc/hosts" now cached; subsequent lookups are free
*
* Hard links: multiple dentries point to the same inode.
* inode->i_nlink counts hard links. Unlink decrements i_nlink.
* When i_nlink == 0, the inode is pending deletion (but may still be open).
*/
/* /proc/sys/vm/drop_caches:
* echo 3 > /proc/sys/vm/drop_caches → drop page cache + dentries + inodes.
* Use with caution in production (forces everything to disk, then evicts).
*/ Block Layer: bio, Request Queues, and Schedulers
The block layer is the software bridge between filesystem I/O requests and physical storage devices. Filesystems never talk to disks directly — they submit bios (block I/O descriptors) which the block layer routes through an I/O scheduler, merges with adjacent requests, and dispatches to a device driver. Modern NVMe devices use the multi-queue block layer: per-CPU software queues feeding hardware submission queues, eliminating lock contention at high IOPS.
struct bio and Request Queues
/* struct bio: the fundamental I/O descriptor in the block layer.
* Represents one contiguous memory region to read or write from/to a block device.
* Used for both buffered file I/O and raw block device access.
*/
struct bio {
struct bio *bi_next; /* request queue chain */
struct block_device *bi_bdev; /* target device */
unsigned int bi_opf; /* REQ_OP_* + flags (READ, WRITE, ...) */
unsigned int bi_iter.bi_size; /* remaining byte count */
unsigned int bi_iter.bi_sector;/* current sector (on-disk addressing) */
unsigned int bi_iter.bi_idx; /* current bvec index */
struct bvec_iter bi_iter;
struct bio_vec *bi_io_vec; /* scatter-gather array */
struct bvec_iter bi_max_vecs;
atomic_t __bi_remaining; /* completion refcount */
struct block_device *bi_bdev;
/* bio flags: REQ_NOWAIT, REQ_BACKGROUND, REQ_RAHEAD, REQ_DISCARD,
* REQ_SECURE, REQ_FUA, REQ_PREFLUSH, REQ_MIXED_MERGE, ... */
unsigned long bi_opf;
struct bio_integrity_iter *bi_integrity;
void *bi_end_io; /* completion callback */
void *bi_private;
struct request *bi_rq; /* associated request (if merged) */
...
};
struct bvec_iter {
unsigned int bi_sector; /* sector address */
unsigned int bi_size; /* remaining I/O in bytes */
unsigned int bi_idx; /* current bvec index */
unsigned int bi_bvec_done; /* offset in current bvec */
};
struct bio_vec {
struct page *bv_page;
unsigned int bv_len; /* page segment length */
unsigned int bv_offset; /* offset within page */
};
/* Submitting a bio:
* submit_bio(bio) →
* generic_make_request(bio) →
* blk_queue_bio(q, bio) → // q = bdev_get_queue(bio->bi_bdev)
* elevator_make_request_fn() // I/O scheduler merge + insert
* or direct dispatch to driver (no scheduler)
* __blk_mq_submit_bio() // multi-queue: route to per-CPU sw queue
*
* The request queue (struct request_queue) holds queued bios:
* - elevator: merged and sorted by physical sector
* - plug list: batched for throughput (cfq/blk-mq plug)
* - dispatch: submitted to driver in sector order
*/
/* Block size vs page size:
* PAGE_SIZE = 4096 bytes on x86_64 (the unit of memory management)
* logical block size = what the filesystem uses (typically 4096)
* physical block size = what the device uses (often 512 or 4096 for NVMe)
* bio->bi_iter.bi_sector is always in 512-byte sector units for the block layer.
* The block layer converts: (byte_offset / 512) = sector address.
*/ I/O Schedulers
I/O schedulers (also called elevators) merge adjacent bios into requests and
order them to minimize disk seeks. The scheduler lives between the filesystem and
the driver. On fast NVMe devices with sequential access patterns, the scheduler
overhead can exceed benefit — none is often optimal. On spinning disks
with random workloads, mq-deadline or bfq helps significantly.
/* Current schedulers (in linux/block/):
*
* none — No scheduler. Submits bios directly to hardware.
* Best for: fast NVMe, pure sequential throughput, io_uring async.
* sysctl: echo none > /sys/block/nvme0n1/queue/scheduler
*
* mq-deadline — Per-CPU submission queues, sorted by sector, dispatched round-robin.
* Latency target per dispatch batch. Good for latency-sensitive mixed workloads.
* Merges adjacent sectors; dispatches in deadline order (per-batch).
* sysctl: echo deadline > /sys/block/nvme0n1/queue/scheduler
*
* bfq — Budget Fair Queueing. Groups I/O by process (cgroup), allocates
* time slices. Good for interactive/desktop workloads where one process
* shouldn't starve others. Higher latency than deadline.
* sysctl: echo bfq > /sys/block/nvme0n1/queue/scheduler
*
* kyber — Token-based. Keeps latency low by tracking queue depth vs target.
* Good for fast NVMe with latency-sensitive workloads. Tunable depths.
*
* cfq — Completely Fair Queueing (deprecated, removed in 6.8).
* Was the default for years. Replaced by mq-deadline + bfq + kyber.
* Used iocontext-based scheduling; slower for high-IOPS workloads.
*/
/* Scheduler tunable parameters (echo to /sys/block/*/queue/):
* read_lat_nsec — target read latency (deadline/kyber)
* write_lat_nsec — target write latency
* read_bypass_threshold — bypass scheduler for large sequential reads
* write_bypass_threshold — bypass scheduler for large sequential writes
* IOPS mode (cat /sys/block/*/queue/iosched/low_latency)
* fifo_batch — deadline: how many requests per dispatch (default 4)
*/
/* How elevator merge works:
* bio submitted → elevator_lookup_batch()
* Check if adjacent to existing request (front or back)
* If sector == request_end or request_start - 1 → merge
* Merge saves I/O overhead: one request instead of two
*
* Example: Filesystem writes 4K at sector 100, then 4K at sector 104 (adjacent).
* Instead of two 4K requests, elevator merges into one 8K request → fewer driver round-trips.
*/
/* /sys/block/nvme0n1/queue/ parameters:
* nr_requests — max in-flight requests (default 64 per queue)
* read_ahead_kb — readahead window (default 128 KB)
* max_sectors_kb — max single request size (default 512 KB)
* max_hw_sectors_kb — hardware max
* rotational — set to 0 for NVMe/SSD
* queue_depth — device queue depth (from Inquiry data)
* scheduler — [none] deadline bfq kyber
*/ Multi-Queue Block Layer (blk-mq)
/* Traditional single-queue elevator model:
* request_queue (single lock) ← all CPUs submit bios → contention at high IOPS
*
* Multi-queue model (blk-mq, since Linux 3.13):
* - Per-CPU software queues: one per CPU
* - Per-hardware-queue submission (NVMe: up to 64k hardware queues)
* - Mapping: CPU → software queue → hardware queue (via blk_mq_map_queue())
*
* struct blk_mq_tag_set {
* const struct blk_mq_ops *ops;
* unsigned int nr_hw_queues;
* unsigned int queue_depth;
* ...
* };
*
* blk_mq_submit_bio():
* 1. map to queue: hctx = blk_mq_map_queue(bio->bi_bdev, ctx)
* 2. try to merge with hctx->dispatch (tail of current dispatch list)
* 3. alloc request: rq = blk_mq_get_request(hctx, bio)
* 4. blk_mq_submit_request(rq) → kick hardware submission queue
*
* Hardware queues map to NVMe submission/completion queue pairs.
* On a 16-core machine with NVMe: 16 software queues → up to 16 hardware queues.
* NVMe driver creates hardware queue per CPU online.
*
* For io_uring with block devices:
* - io_uring can submit directly to blk-mq (block operations)
* - Registered ring buffers use pre-pinned pages → lower per-I/O overhead
* - sqe->opcode = IORING_OP_WRITE_FIXED or READ_FIXED with fixed_file index
*/
/* Block layer stats (/proc/diskstats):
* reads_completed, reads_merged, reads_sectors, read_time_ms
* writes_completed, writes_merged, writes_sectors, write_time_ms
*
* iostat -xz 1 shows utilization %, queue depth, IOPS, throughput:
* %util — busy time (approximates queue full)
* avgqu-sz — average queue depth (demand vs device capacity)
* r_await / w_await — average latency from submission to completion
*/ ext4: Extents, Block Groups, and Journaling
ext4 is Linux's default filesystem for most deployments. It extends ext3 with extents (contiguous block allocation), nanosecond timestamps, deferred allocation (delalloc), and fast fsck. The on-disk layout divides the device into block groups of ~128 MiB each, with per-group bitmaps and inode tables. The journal is a circular log of ext3-style transactions that guarantees consistency after crashes.
Block Group Layout and Superblock
/* ext4 layout (4K blocks):
* Block 0: boot sector (not used by ext4)
* Block 1: primary superblock (1024 bytes into block 1)
* Blocks 2-N: block group descriptors (GDT)
*
* Block group 0: [superblock(1) + GDT(desc) | block bitmap | inode bitmap | inode table | data blocks]
* Block group 1: [block bitmap | inode bitmap | inode table | data blocks]
* Block group N: [block bitmap | inode bitmap | inode table | data blocks]
*
* Superblock (struct ext4_super_block, 1024 bytes at offset 1024):
* s_inodes_count — total inodes (set at mkfs, never changes)
* s_blocks_count_lo — total blocks (device capacity)
* s_first_data_block — almost always 1 (block 0 = boot)
* s_blocks_per_group — blocks per block group (default 32768 for 128 MiB groups)
* s_inodes_per_group — inodes per block group (mkfs calculates from bytes-per-inode)
* s_mount_opts — mount flags (e.g., barrier=1, user_xattr)
* s_log_block_size — block size: 0=1024, 1=2048, 2=4096, 3=8192
* s_log_cluster_size — allocation cluster size (for large allocations)
* s_magic — 0xEF53 (ext2/3/4 magic)
* s_state — filesystem state: 0x0001 = clean, 0x0002 = errors
* s_first_ino — first non-reserved inode (usually 11 for ext4)
* s_inode_size — inode size in bytes (128 or 256 with 256-byte inodes)
* s_block_group_nr — which block group this superblock backs
* s_journal_inum — inode number of journal file (ext4)
*
* Block group descriptor (struct ext4_group_desc, 32+ bytes):
* bg_block_bitmap_lo — block number of block allocation bitmap
* bg_inode_bitmap_lo — block number of inode bitmap
* bg_inode_table_lo — first block of inode table
* bg_free_blocks_count_lo
* bg_free_inodes_count_lo
* bg_used_dirs_count_lo
* (64-bit extensions: _hi variants for > 2^32 blocks)
*
* On 128 MiB block groups with 4K blocks:
* 32768 blocks × 4096 bytes = 128 MiB per group
* Inode table: s_inodes_per_group × s_inode_size bytes
* 8192 inodes × 256 bytes = 2 MiB per group for inode table
* Block bitmap: 1 block = 32768 bits = 32768 blocks addressable ✓
*/
/* mkfs.ext4 defaults:
* -b 4096 — 4K block size
* -i 4096 — 4096 bytes per inode (good for millions of small files)
* -I 256 — 256-byte inodes (enough for ext4 extents, inline xattr)
* -g 32768 — blocks per group
* -G 8192 — inodes per group
* -O ^64bit — disable 64-bit feature (for compatibility)
* -E stride=4096,stripe-width=8 — optimal for RAID
*
* # Optimal RAID-aligned mkfs for 8-disk RAID6:
* mkfs.ext4 -E stride=128,stripe-width=1024 /dev/md0
* stride = chunk_size / block_size = 65536 / 4096 = 128
* stripe-width = stride × (n-2) = 128 × 6 = 768 ... actually 128*(8-2) = 768
* But: stripe-width should be multiple of n-2 for RAID6 = 6 → 128×6=768
*/
/* The inode table: each inode (struct ext4_inode, 128 or 256 bytes):
* i_mode, i_uid_lo, i_size_lo, i_atime, i_ctime, i_mtime, i_dtime
* i_gid_lo, i_links_count, i_blocks_lo (512-byte sectors / 2 = 4K blocks)
* i_flags (EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, ...)
* i_osd1 (ext4-specific)
* i_block[15] — extent tree root (if EXT4_EXTENTS_FL) or block pointers
* i_extra_isize, i_checksum_lo, i_ctime_extra, i_mtime_extra, i_version
*
* With 256-byte inodes (s_inode_size=256): extended attributes inline.
* i_extra_isize = 32 (the minimum) stores i_version, i_ctime_extra, i_mtime_extra.
* Extended attributes beyond that go to an xattr block (outside the inode).
*/ Extent Trees and Allocation
/* Extent-based file layout (EXT4_EXTENTS_FL):
*
* Each file's i_block[0..14] (15 * 4 bytes = 60 bytes) holds the extent tree root.
* The extent tree is a 4-level B-tree: header + up to 3 extent entries in the inode.
* Larger trees allocate separate blocks for the tree nodes.
*
* struct ext4_extent_header (12 bytes, in i_block[0]):
* eh_magic = 0xF30A
* eh_entries — number of valid entries
* eh_max — capacity in this block
* eh_depth — 0 = leaf (data), 1-4 = internal node (points to child blocks)
* eh_gen — generation for online defrag
*
* struct ext4_extent (12 bytes, leaf node):
* ee_block — first logical block number (file-relative)
* ee_len — number of blocks in this extent (1-32768)
* ee_start_hi — upper 16 bits of physical block number
* ee_start_lo — lower 32 bits of physical block number
*
* struct ext4_extent_idx (12 bytes, internal node):
* ei_block — first logical block this child covers
* ei_leaf_lo — lower 32 bits of child node's block number
* ei_leaf_hi — upper 16 bits of child node's block number
* ei_unused — padding
*
* Maximum file size with 4K blocks, 4-level extent tree:
* 1 extent covers up to 32768 blocks = 128 MiB
* 4 levels × 3 entries per level (in inode) = a lot
* Limits: i_size max = (2^32 - 1) * 4096 = ~16 TiB, actual ~2 TiB due to extent limits
* With 64K blocks: max file ~256 TiB
*
* Extent example:
* File logical blocks 0-499: extent covering physical block 204800, len 500
* File logical blocks 500-999: extent covering physical block 2097152, len 500
* Sparse: if logical block 500 doesn't exist, no extent entry for it
*
* Allocation policy (ext4_mb_new_blocks):
* - Goal: find a block group with enough contiguous free blocks
* - Buddy algorithm within block groups (ext4_mb_load_buddy)
* - Preallocate: mballoc keeps a "nextents" window per process
* - Or: delay allocation until writepage (ext4_da_writepages)
* delalloc reserves blocks in the extent tree (EXT4_EX_NOCACHE)
* Actual allocation at writeback time
*
* # Inspect extents of a file:
* filefrag -v /var/log/messages
* Filesystem di.extents: 2 extents
* /var/log/messages: 0 extent(0) logical: 0 physical: 524288 ext: 8
* 1 extent(1) logical: 8 physical: 524296 ext: 248
*/ Journal (jbd2)
ext4's journal (jbd2, Journaling Block Device v2) is a circular write-ahead log. All metadata changes are recorded in the journal before being applied to the filesystem proper. On crash, the journal is replayed: committed transactions are applied, uncommitted transactions are rolled back. This gives fast recovery (seconds, not the hours that fsck takes on large filesystems).
/* Journal structure:
* Located at s_journal_inum inode (internal journal) or separate device.
*
* Journal superblock (at start of journal device or journal inode):
* s_header.h_magic = JFS_MAGIC_NUMBER (0xC03B3998)
* s_header.h_blocktype = JFS_SUPERBLOCK_V2 (1)
* s_sequence = next transaction ID (increments on each commit)
* s_start = first log descriptor block (byte offset in journal)
* s_errno = last error code
* s_feature_ro_compat = JBD2_FEATURE_INCOMPAT_
* s_nr_users = number of filesystems sharing this journal
* s_bytes_per_block = blocksize (typically 4096)
*
* Transaction format on disk:
* [ descriptor block(s) ] — contains t_handle_t magic, txn_id, block references
* [ data blocks ] — block copies (for data=journal mode)
* [ commit block ] — JFS_COMMIT_BLOCK, txn_id → atomic commit
*
* jbd2 transaction lifecycle:
* handle = jbd2__journal_start(inode->i_sb, JBD2_OP_NUM_BLOCKS)
* → allocate new transaction, assign txn_id
* jbd2_journal_dirty_metadata(handle, buffer_head)
* → mark buffer as part of this transaction (BJ_Metadata flag)
* jbd2_journal_stop(handle)
* → check if we should commit (batch threshold reached)
* → jbd2_journal_commit_transaction() → write all descriptors + commit block
* → wait for commit I/O to complete (if ordered/journal mode)
*
* Descriptor block: one per transaction, lists every buffer_head being logged.
* Each entry: 20 bytes (jbd2_journal_block_tag_t)
* tag.t_blocknr — block number being modified
* tag.t_flags — flags: escape, sync, etc.
*
* Recovery:
* jbd2_journal_recover()
* - Read journal superblock, find last committed txn (s_sequence)
* - Scan forward: for each commit, apply all logged blocks to filesystem
* - Stop at first invalid commit
* - Mark filesystem clean in superblock
*/
/* Mount options for journal mode:
*
* data=ordered (default) Metadata in journal; data blocks forced to disk
* BEFORE commit record is written. Uses journal_mark_fdatasync().
* Guarantees: after crash, file data is not stale. Most common.
*
* data=writeback Metadata in journal; data blocks written after commit record.
* Risk: stale data visible after crash (old ext3 behavior).
* Fastest journaling mode. Use for bulk sequential writes.
*
* data=journal Full journaling: both metadata and data written to journal first,
* then committed, then moved to final location.
* 2× write amplification. Use only for very high integrity needs
* (journal device on fast SSD for metadata-only).
*
* barrier=1 (default for ext4) Forces writeback of disk cache (FUA or flush)
* before journal commit record. Protects against disk write-back cache
* losing data on power loss. Disable (barrier=0) on JBOD with BBU
* cache or on SSDs that handle barriers internally (most do).
*
* # Check journal state:
* tune2fs -l /dev/nvme0n1p2 | grep -E "Journal|Features"
* Journal inode: 8
* Journal backup: inode bitmap block 0, group descriptor inodes
* Filesystem features: has_journal resize_inode dir_index flex_bg
* sparse_super large_file huge_file dir_nlink extra_isize
*
* # External journal on separate SSD:
* mkfs.ext4 -J device=/dev/nvme1n1p1 /dev/nvme0n1p2
* # or existing journal resize:
* tune2fs -J size=512 /dev/nvme0n1p2
* # External journal with stripe alignment:
* mkfs.ext4 -J device=/dev/nvme1n1p1 -J sunit=1024,swidth=4 /dev/nvme0n1p2
*/ Reserved Blocks, Inline Data, and Fast Commit
/* Reserved blocks:
* Default: 5% of blocks reserved for root (tune2fs -m 5 /dev/...)
* Purpose: allow root-owned processes (fsck,日志 recovery) to write even when
* disk is "full" from user perspective.
* tune2fs -m 0 /dev/... → reserve 0% for root (not recommended)
* tune2fs -r 0 -m 0 → reserve 0 blocks, 0% for root
*
* Inline data (EXT4_INLINE_DATA_FL, since ext4 3.8):
* If file size < 60 bytes and there is space in the inode, store data in i_block.
* This avoids allocating a data block for tiny files.
* Check: lsattr file → 'e' flag (extent format), then look at i_size.
* Trace: ext4_readdir() checks EXT4_INLINE_DATA_FL → read inline data instead of block.
*/
/* Extended Attributes in ext4:
* Inode xattr area: i_extra_isize gives available space (typically 32 bytes)
* Beyond that: external xattr blocks allocated.
*
* xattr_entry structure:
* e_name_len — length of attribute name
* e_name — attribute name bytes
* e_value_offs — offset to value in the xattr block
* e_value_inum — inode number if value stored in external inode (system.posix_acl_*)
* e_value_size — bytes of value
* e_hash — hash of (name, value) for fast lookup
*
* Namespaces:
* user.filename → userspace-accessible attributes (mount -o user_xattr)
* trusted.filename → privileged tools (CAP_SYS_ADMIN)
* system.posix_acl_access → 64-byte POSIX ACL
* system.posix_acl_default → default ACL for directories
* security.* → SELinux (security.selinux), capabilities
*/
/* Fast Commit (EXT4_FEATURE_RO_COMPAT_FAST_COMMIT):
* ext4 records "fast commits" (FC) for eligible updates instead of full journaling.
* Fast commits are lighter: they record just the changed extents + inode deltas,
* not a full transaction descriptor block set.
*
* Eligible: truncate, extent splits, inode updates, rename.
* Ineligible: journal mode, encryption, casefolding, verity.
*
* FC lifecycle:
* jbd2_journal_start() → if FC-eligible, use FC path → ext4_fc_start()
* ext4_fc_track_inode() → track inode changes in FC state
* ext4_fc_stop() on handle stop → commit via jbd2
* Recovery: replayed as normal journal records (no special handling needed)
*
* # Check fast commit status:
* tune2fs -l /dev/nvme0n1p2 | grep FAST_COMMIT
* Filesystem features: ... fast_commit ...
*/
/* Online defragmentation (e4defrag):
* ext4 5.15+ supports online defrag via ext4_falloc_update_inode()
* e4defrag /path/to/file → rewrites extents to be contiguous
* e4defrag /mount → defragment all files in the mount
*
* # Check fragmentation:
* e4defrag -c /path
* # Shows: current extents, best possible, fragmentation %
*
* Note: defrag on mounted filesystem moves data while writes may be in-flight.
* ext4 locks the inode during extent manipulation (EXT4_IOC_MOVE_EXT).
*/ XFS: Allocation Groups, B+Trees, and Deferred Logging
XFS is a high-performance journaling filesystem designed for parallel I/O and large scale. Its defining structure is the Allocation Group (AG) — each AG is an independent allocation domain with its own free-space B+trees, inode allocation btree, and inode allocator. Multiple threads can allocate space in different AGs concurrently with no lock contention, making XFS excellent for multi-CPU, multi-threaded workloads. The journal uses deferred logging: many metadata changes are batched into a single transaction before hitting the journal.
Allocation Groups and the Per-AG Structure
/* XFS filesystem layout:
*
* Each device is carved into Allocation Groups (AGs).
* AG size = agblocks × blocksize. Default: 1 TiB per AG (for typical configs).
* Number of AGs = device size / AG size.
*
* AG structure (at start of each AG):
* Block 0: AGF (Allocation Group Free space header)
* Block 1: AGI (Allocation Group Inode header)
* Blocks 2+: inode btree (for inode allocation) + free space btrees + data
*
* AGF (struct xfs_agf):
* agf_magicnum = XFS_AGF_MAGIC (0x58444241 = 'XAGF')
* agf_versionnum = 3
* agf_seqno = AG number (0, 1, 2, ...)
* agf_length = actual blocks in this AG
* agf_roots[2] = root block numbers for free space B+trees (agno, bno btrees)
* agf_levels[2] = B+tree heights (0 = single block, 1 = one level, ...)
* agf_freeblks = total free blocks in this AG
* agf_longest = longest free extent in this AG
* agf_flfirst = first block in AGF (free list head)
* agf_fllast = last block in AGF
* agf_flcount = count of blocks in AGF
* agf_agfl天地 = free list (circular buffer of block numbers)
*
* AGI (struct xfs_agi):
* agi_magicnum = XFS_AGI_MAGIC (0x58414649 = 'XAGI')
* agi_versionnum = 2
* agi_agino_root = root of inode btree (root is block number)
* agi_agino_level = inode btree height
* agi_agino_count = number of allocated inodes in this AG
* agi_agino_root2 = second inode btree root (for larger configs)
* agi_unlinked[64] = hash table of inodes with i_nlink==0, being deleted
*
* Superblock (at byte 0, only AG 0 has it):
* sb_magicnum = XFS_SB_MAGIC (0x58465342 = 'XFSB')
* sb_blocksize = block size (min 512, default 4096, max 65536)
* sb_agblocks = blocks per AG (must be power of 2)
* sb_agcount = number of AGs
* sb_width = used for crc calculation
* sb_unit = stripe unit for data
* sb_width = stripe width for data
* sb_rootino = inode # of root directory
* sb_rbmino = inode # of realtime bitmap
* sb_rsumino = inode # of realtime summary
* sb_realtime = 1 if realtime device present
* sb_meta_uuid = metadata UUID (for scrub after device change)
*/
/* AG size formula and mkfs defaults:
* mkfs.xfs -d agcount=8 → 8 AGs (size / 8 each)
* mkfs.xfs -s size=2g → AG size = 2 GiB (default: based on device size)
*
* # Optimal for large files (video editing, databases):
* mkfs.xfs -d agcount=4 -n size=64k -l size=128m /dev/nvme0n1
* # 4 AGs, 64K directory chunk size, 128M journal
*
* # Stripe-aligned for RAID:
* mkfs.xfs -d su=64k,sw=4 -l sunit=8,swidth=32 /dev/md0
* # su=64K stripe unit, sw=4 data disks (RAID5/6), sunit=8 512-byte sectors/4K block
*/ B+Trees for Free Space and Inode Allocation
/* XFS B+tree structures (on-disk):
* Two types of free space btrees per AG:
* 1. by block number (bno) — finds free extents at a given block
* 2. by block count (cnt) — finds free extents of a given size
*
* And one inode btree per AG:
* - maps inode number → location (AG, block within AG)
*
* Generic B+tree node (struct xfs_btree_block):
* bb_magic = XFS_BMAP_CRC_MAGIC / XFS_ABTB_MAGIC / XFS_ABTC_MAGIC
* bb_level = 0 for leaf, 1+ for internal nodes
* bb_numrecs = number of records in this block
* bb_leftsib = previous sibling block
* bb_rightsib = next sibling block
* (then records follow)
*
* B+tree leaf for free space by length (cnt btree, struct xfs_alloc_rec):
* ar_startblock — first block of free range
* ar_blockcount — length of free range
* One record per contiguous free extent in the AG.
*
* B+tree internal node (struct xfs_alloc_ptr):
* api_startblock — child block number
* api_startblock — logical block this child covers (max key)
*
* Lookup by length:
* xfs_alloc_fix_minleft() — ensure minleft blocks reserved
* xfs_btree_query_range() — find extents >= minlen
* Takes cnt btree from AGF root → walks to leaf → finds best fit
*
* Lookup by block number:
* xfs_btree_lookup() — find the leaf covering a given block number
* Used during allocation to find which AG has the needed space
*
* Inode btree: maps inode numbers to inode locations.
* xfs_inode_ag allocator walks the AGI hash → allocates inodes sequentially
* New inode: find AG with space (round-robin), find free spot in inode btree
*
* # Inspect XFS allocation:
* xfs_info /mnt/data
* meta-data=DMyRAID isize=512 agcount=8, agsize=268435455 blks
* data =inode sunit=8 swidth=4 blks
* naming =version2 bsize=4096 cpi=1 wsize=4096
* log =internal isize=16384 blocks=32768, version=2
* =sunit 8 blks
* realtime =none rtextsize=16384 extsize=16384
*/ Deferred Logging and the Journal
/* XFS journal (log) design:
* Unlike ext4's synchronous transaction commits, XFS uses deferred logging.
* Metadata modifications are accumulated in memory, tagged with transaction IDs,
* and flushed to the log in batches. This dramatically reduces log I/O overhead.
*
* Log record structure:
* log record = header +Regions (region = memory block ID + byte offset + length)
* Header: sequence number, checksum, start block on disk
*
* Deferred ops example:
* Unlink a file: directory entry removed + inode nlink decremented + free space updated
* ext4: three metadata changes → three journal blocks → three commit records
* XFS: three metadata changes → one deferred transaction → one journal record → commit
*
* Transaction flow in XFS:
* xfs_trans_alloc() → allocate new transaction struct
* xfs_attach_dirents() → attach directory modifications as deferred ops
* xfs_attach_item() → attach inode to transaction
* xfs_trans_commit() →
* - jbd2 style: write descriptor blocks with all block numbers
* - But XFS batches this with other in-progress transactions
* - Log writer thread runs asynchronously, committing batches
*
* Log I/O is fully async: tx commit writes log header → bio submitted → return
* Completion waiter (tic structure) polls for completion or is woken by interrupt.
*
* Recovery: journal replay (forward from last clean shutdown marker)
* - Scan log forward: apply all committed transactions
* - Transactions with missing commit record → rolled back (revoked)
*
* External log device:
* mkfs.xfs -l logdev=/dev/nvme1 /dev/nvme0
* # Log on separate fast SSD → much better fsync latency
* # Especially for databases with many small synchronous writes
*
* Internal log (default): on same device as data.
* Performance: writes go to log first (journal), then async to data location.
* Log WAL (write-ahead) ensures committed transactions survive crash.
*
* Log size considerations:
* - Larger log = more room for uncommitted transactions = better batching
* - Default: 128 MB or 32K blocks, whichever is larger
* - tune: -l size=256m (256 MB log)
* - Log full: xfs_force_shutdown(log) → filesystem marked corrupt
*/
/* xfs_info shows:
* isize — inode size (512 on older, 512 or 256 on newer, 256 is default)
* agsize — AG size in blocks
* agcount — number of AGs
* sunit/swidth — stripe unit/width for aligned I/O (RAID settings)
* log — log device (internal or external)
*/ Realtime Subvolume and growfs
/* XFS Realtime Subvolume:
* A separate device (rtdev) where extents are allocated in fixed sizes.
* Purpose: guarantee contiguous physical space for files (video, databases).
*
* Realtime allocation:
* mkfs.xfs -r rtdev=/dev/nvme2 -b size=64k /dev/nvme0
* # All extents are 64K on the rtdev. Files get exactly their size in extents.
*
* rtdev layout:
* rt bitmap (1 block per extent) — marks used/free extents
* rt summary — per-extent-size summary for fast allocation
* rt data — actual data extents
*
* xfs_info shows:
* realtime =/dev/nvme2 rtextsize=4096 rtgeo=1953
* # rtextsize = extent size in filesystem blocks (4096 × 4K = 16 MiB extents)
* # rtgeo = number of rt blocks
*
* Restrictions:
* - Only regular files (no directories on rtdev)
* - Extents are fixed size (rtbmb size)
* - You can't shrink the rt subvolume (only grow)
*
* # Grow the rtdev:
* xfs_growfs /mnt/video -r rtdev=/dev/nvme2
*/
/* xfs_growfs:
* Grow filesystem online (data device or rtdev)
* xfs_growfs /mnt/data — add more AGs if device grew
* xfs_growfs /mnt/data -D 1T — increase data section to 1T (if space available)
*
* Limitations:
* - Cannot shrink XFS (use LVM or backup+restore)
* - Can only grow AGs, not individual AG size
* - rtdev grow: xfs_growfs -r rtdev=/dev/nvme3
*
* Online scrub and repair:
* xfs_scrub /mnt/data — scrub all metadata (re-read and verify checksums)
* xfs_repair -v /dev/nvme0n1 — offline repair, can fix most corruption
* xfs_db -r -c "frag -v" /dev/nvme0n1 — check extent fragmentation
*/ Btrfs: Copy-on-Write, B-Trees, RAID, and Snapshots
Btrfs is a copy-on-write filesystem with checksums, snapshots, and built-in RAID. Everything — metadata and data — is stored as checksumed b-tree nodes. Writing a file doesn't overwrite blocks: it allocates new blocks, writes the data, updates the parent tree pointers to the new blocks, and commits the transaction. Old blocks remain reachable from snapshots until their reference count hits zero. This makes snapshots O(1) and enables bit-rot detection on every read.
On-Disk B-Tree Structure
/* Every block on a Btrfs filesystem is a tree node. No separate inode table.
* B+tree variant with these properties:
* - All items in leaf nodes
* - Internal nodes point to child nodes by key
* - Keys are (objectid, type, offset) tuples sorted
*
* struct btrfs_header (48 bytes at start of every block):
* csum — crc32c of the rest of the block (body after header)
* fsid — filesystem UUID (same as superblock)
* bytenr — physical address of this block (self-verify)
* flags — BTRFS_HEADER_FLAG_MIRROR_* (for scrubbers)
* chunk_tree_uuid — which chunk tree this block belongs to
* generation — transaction ID when this node was last modified
* owner — tree ID (1=chunk tree, 256+=subvol roots)
* nritems — number of items in this node
* (items follow inline after header)
*
* struct btrfs_key (16 bytes):
* objectid — inode number, or special values (0=CHUNK_ITEM, 2=SUPERBLOCK)
* type — item type: 1=INODE_ITEM, 2=INODE_REF, 12=EXTENT_DATA,
* 72=ROOT_ITEM, 84=DIR_ITEM, 108=EXTENT_CSUM, etc.
* offset — key-specific: byte offset for files, sequence for dir items, etc.
*
* Leaf item format (btrfs_item):
* key — the key for this item
* offset — byte offset in this leaf where the item data starts
* size — size of item data
* data — item data at 'offset' bytes from start of leaf
*
* Root tree (tree ID 1): contains all other trees' roots (chunk tree, fs tree, ...)
* Chunk tree (tree ID 1): maps physical extents to chunk profiles + devices
* FS tree (tree ID 256): per-subvolume tree of files and directories
*
* Key lookup: btrfs_search_slot() → binary search within leaf → descend or return
*
* Transaction commits:
* Each commit increments the generation number. The superblock points to the
* current root (root= key (BTRFS_ROOT_ITEM_KEY, 0, 0) in root tree).
* This is how snapshots work: the old root stays reachable from the old generation.
*/
/* Btrfs extent tree (in the FS tree, not a separate tree):
* EXTENT_DATA items: key=(ino, EXTENT_DATA, file_offset)
* extent_inline_ref: physical extent location, size, compression, encryption
*
* EXTENT_CSUM items: key=(objectid, EXTENT_CSUM, block_group_start)
* Each block's data has a corresponding csum item. Scrub verifies against it.
*
* ROOT_ITEM items: key=(subvol_id, ROOT_ITEM, 0)
* Points to the root node of a subvolume tree.
*
* Snapshot creation: create a new ROOT_ITEM pointing to the same root node
* (reference count on node increases). No data is copied. Snapshot is instant.
*/ Extent Tree, Checksum Tree, and Block Groups
/* Btrfs block groups (chunk allocation):
* - Raw device space is divided into 1 GiB chunks (configurable at mkfs)
* - Each chunk is one RAID profile (single, dup, raid1, raid10, raid5, raid6)
* - Chunk allocation triggers when chunk usage crosses threshold (~20% used)
*
* Block group item (in chunk tree):
* key: (chunk_tree, CHUNK_ITEM_KEY, offset)
* chunk_item: chunk_start, chunk_length, stripe_count, stripe_length,
* num_stripes[stripe_count] = device ID + physical offset
*
* Allocation flow:
* btrfs_alloc_chunk() → picks chunk profile → allocates chunk
* btrfs_alloc_data_block() → assigns extent from chunk
* Chunks are striped across devices per RAID profile
*
* # Show chunk and block group info:
* $ btrfs balance dump /mnt/data | head -50
* [shows chunk allocation, device IDs, profile, used/total]
*/
/* Checksum tree (checksum items):
* key: (BTRFS_EXTENT_CSUM_KEY, 0, block_group_start)
* csum_item: per-4K-block crc32c of data extents
*
* Scrub reads every data block, recomputes crc32c, compares to stored checksum.
* If mismatch: report corruption, attempt to repair from mirror (RAID1/10).
*
* Btrfs uses crc32c (hardware-accelerated on modern CPUs with SSE4.2).
* zstd-compressed blocks have their own checksum inside the compressed data.
*/
/* Chunk and extent relationship:
* Chunk = physical allocation unit from raw devices
* Extent = logical range inside Btrfs address space
*
* Example: 256K file → one extent covering 256K
* Physical: if single profile, 256K from chunk at offset X on device nvme0n1
* If raid1: same 256K on device nvme1n1 as mirror
*
* Reference counting: btrfs_extent_data_ref tracks which subvol+inode+offset
* points to each physical extent. Multiple snapshots of the same file share
* the same physical blocks until one is written (CoW breaks the sharing).
*/
/* Block group types:
* SYSTEM — allocation metadata (chunk tree nodes)
* METADATA — all Btrfs tree nodes + filesystem metadata
* DATA — file data blocks
*
* Default chunk sizes:
* mkfs.btrfs -d single -m single /dev/nvme0 # 1 GiB chunks, mixed profile
* mkfs.btrfs -d raid0 -m raid1 -f /dev/nvme0 # force RAID (needs >= 2 devices)
*
* Mixed chunk profile: one chunk contains both metadata and data.
* Not recommended for large arrays; separate profiles give better resilience.
*/ RAID Profiles, Balance, and Scrub
/* Data profiles:
* single — 1 copy, no redundancy (space-efficient, no fail tolerance)
* dup — 2 copies on same device (not real RAID; for metadata only on single-device)
* raid0 — striping, no parity (perf, no redundancy)
* raid1 — 2 copies on different devices (any 1 device can fail)
* raid10 — striping + mirroring (perf + redundancy, 2 copies minimum)
* raid5 — 1 parity block, any 1 device can fail (write-hole risk)
* raid6 — 2 parity blocks, any 2 devices can fail (safer for large arrays)
*
* Metadata profiles: same options. Default: dup (single-device), raid1 (multi-device).
*
* RAID5/6 write hole:
* On power failure mid-write: P or Q syndrome may be inconsistent with data.
* Mitigation: Btrfs writes data + parity together, uses fsync to force ordering.
* But on battery-less SSDs with capacitor-backed cache: much safer.
* Recommendation: avoid raid5/6 for critical data until write hole is fully fixed.
*
* btrfs balance start -dconvert=raid1,soft /mnt/data
* # -d = data, -m = metadata, -s = system chunks
* # soft: don't move chunks unless needed (avoids unnecessary I/O)
*/
/* Balance:
* Rewrites all chunks to rebalance usage across devices.
* Also used to convert profiles (raid0 → raid1) without data loss.
*
* btrfs balance start /mnt/data
* # Full balance: defragment + redistribute
*
* btrfs balance start -dusage=50 /mnt/data
* # Only rebalance chunks with > 50% usage
*
* btrfs balance cancel /mnt/data
* # Balance can be paused and resumed
*
* btrfs balance status /mnt/data
* # Show progress
*
* After device replace: balance to reclaim freed space from old device
* btrfs replace start /dev/sdb /mnt/data
*/
/* Scrub:
* btrfs scrub start /mnt/data # background, resumable
* btrfs scrub status /mnt/data
* btrfs scrub cancel /mnt/data
*
* Scrub reads all data, verifies checksums, repairs from mirrors.
* Use: btrfs scrub start -B /mnt/data (blocking, for cron jobs)
*/
/* Device stats:
* btrfs device stats /mnt/data
* [/dev/nvme0n1].write_errs 0
* [/dev/nvme0n1].read_errs 0
* [/dev/nvme0n1].flush_errs 0
* [/dev/nvme0n1].corruption_errs 0
* [/dev/nvme0n1].generation_errs 0
*
* Any non-zero value indicates a problem. Flush errors often mean the
* device is slow to respond or its write cache is having issues.
*/ Subvolumes, Snapshots, Quotas, and Send/Receive
/* Subvolume = a tree with its own root and inode number space.
* Snapshot = a subvolume created from another subvolume's root.
* Both share the same on-disk blocks (CoW reference counting).
*
* Reference counting:
* struct btrfs_root_ref {
* dirid — inode # in snapshot of the directory where snapshot was created
* sequence — rename sequence
* name_len — length of snapshot name
* name — the name bytes
* };
*
* Each physical block (extent) has a refcount (btrfs_extent_data_ref.count).
* btrfs-extents tree tracks every (subvol, inode, offset) → physical block mapping.
* When refcount reaches 0: block is freed. This is async (batch in transaction).
*
* When you snapshot /mnt/data, then write to a file in /mnt/data:
* - CoW allocates new data blocks for the modified file
* - Old blocks: refcount decremented
* - Snapshot still points to old blocks → snapshot's data is unchanged
*
* Snapshot create is instant: just create a new ROOT_ITEM pointing to same root node.
* Snapshot delete: just delete the ROOT_ITEM entry. Blocks freed when refcount → 0.
*/
/* Quota groups (qgroups):
* Track space used by subvolumes for quota enforcement.
* btrfs qgroup create 0/100 /mnt/data
* btrfs qgroup limit 50G 0/100 /mnt/data/subvol
*
* Qgroups track per-subvol referenced (= logical size) and exclusive (= CoW unique) bytes.
* Reflink copies (cp --reflink) share blocks between subvols; qgroups handle this.
*/
/* Send/Receive: incremental streaming backup:
* btrfs send /mnt/data/snap1 -p /mnt/data/snap0 | ssh backup "btrfs receive /backup"
* # -p: parent snapshot (send incremental diff from snap0 to snap1)
*
* Send generates a stream of operations: create subvol, write X bytes at offset, etc.
* Receive applies them to the target. Stream is compressed (zstd) by default.
*
* Use case: efficient incremental backups to a remote system.
* Without -p: full subvol send (more data, but no dependency on parent)
*
* Note: send requires the source subvol to be read-only or mounted read-only.
* Use: btrfs property set /mnt/data/snap1 ro true
*/
/* Compress and share:
* btrfs filesystem defragment -r /var/log
* # Defragment + compress in-place (files must be in a single profile with space)
*
* Compression algorithms: zlib (slower, better ratio), zstd (faster, similar ratio)
* btrfs property set /mnt/data compression zstd
*
* Copy-on-write with compression:
* - Compressed data is stored as-is (zstd/zlib of the raw bytes)
* - Checksum stored for compressed extent (not per-block csums)
* - Small files often fit inline in tree nodes (no separate data extent)
*
* reflink: cp --reflink src dst
* - Shares physical blocks (refcount incremented)
* - Instant for large files, zero disk I/O
* - Used for deduplication tools, VM images, database snapshots
*/ ZFS: Records, ARC, and Pools
ZFS (from OpenZFS on Linux) is a combined filesystem and volume manager with copy-on-write semantics, end-to-end checksums, RAID-Z, snapshots, and cloning. Storage is organized in pools (zpool) built from vdevs (virtual devices). The ARC (Adaptive Replacement Cache) is the page cache, dramatically more sophisticated than Linux's LRU-based page cache. ZFS uses variable-sized records instead of fixed blocks, making it excellent for streaming workloads but more complex for small random I/O.
/* ZFS pool (zpool) structure:
* Pool = collection of vdevs (virtual devices)
* vdev types:
* mirror — 2+ disks, actual RAID1 (full copies)
* raidz1 — N+1 disks, 1 parity (RAID5 equivalent)
* raidz2 — N+2 disks, 2 parity
* raidz3 — N+3 disks, 3 parity
* spare — hot spare for failed vdevs
* log — separate intent log (ZIL) device
* cache — L2ARC cache device (read cache, not write buffer)
* dedup — deduplication table vdev
*
* Pool creation:
* zpool create tank mirror nvme0n1 nvme1n1
* # 2-disk mirror vdev = 1 mirror vdev with 2 children
*
* zpool add tank mirror nvme2n1 nvme3n1
* # Add second mirror vdev to tank (capacity doubles, no redundancy change)
*
* zpool create tank raidz2 nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1
* # 6-disk raidz2 = 4 data + 2 parity (can lose any 2 disks)
*/
/* ARC (Adaptive Replacement Cache):
* Lives in kernel memory as a separate slab allocator, not in the regular page cache.
* Tracks 4 lists:
* - MRU (Most Recently Used): recency-weighted
* - MFU (Most Frequently Used): frequency-weighted
* - Ghost lists: evicted entries for hysteresis
*
* ARC size: dynamically sized, capped by zfs_arc_max (default: 50% of RAM)
* Can grow above zfs_arc_max briefly under memory pressure before evicting.
*
* ARC stats:
* cat /proc/spl/kstat/zfs/arcstats
* size 4 12884901888 # current ARC size
* c_max 4 17179869184 # maximum cap (16 GB)
* c_min 4 4294967296 # minimum (4 GB)
* hits 4 18446744071589371234 # cache hits
* misses 4 288230376451 # cache misses
*
* L2ARC: secondary cache on fast SSD
* zpool add tank cache /dev/nvme2n1
* # Extends ARC to SSD; L2ARC is read-only cache (written on miss)
* # L2ARC uses adaptive时间段 (128K-1M records) for better hit rate
*/
/* Records:
* ZFS stores data in records. Default record size: 128K.
* Small file: entire file in one record (inline in metadata tree node)
* Large file: records of 128K, one entry in the file's data tree per record
*
* record_size property:
* zfs set recordsize=8K tank/postgres # match DB page size
* zfs set recordsize=1M tank/video # streaming I/O
*
* Compression: zfs set compression=lz4 tank (lz4 default, zstd available)
* Compressed records stored as variable-size; not aligned to record_size
* Accessing a compressed record: decompress on read, recompress on write
*
* Deduplication (zfs set dedup=on tank):
* Block-level dedup: blocks with same checksum share one physical copy
* Dedup table (DDT): maps checksum → physical block reference
* Cost: significant RAM (DDT must stay in ARC), slow writes
* Use only when dataset has truly duplicate data at block level
*/
/* Snapshots in ZFS:
* zfs snapshot tank/postgres@before-migrate
* zfs rollback tank/postgres@before-migrate
*
* Snapshot is a read-only view of a pool at a point in time.
* Implementation: snapshot root = txg (transaction group) number in which it was created.
* The txg contains a pointer to the dataset's dmu_object_set (the file tree root).
* Reading from snapshot: traverse tree with generation = txg number.
*
* Clone: writable snapshot (zfs clone tank/postgres@backup /mnt/restore)
* Clone shares all blocks with original; writes cause CoW.
*
* Send/Receive streams (like Btrfs):
* zfs send -R tank/postgres@snap | zfs receive -F tank/restored
* # -R: recursive (all snapshots of this dataset)
*
* Bookmark: zfs bookmark tank/postgres@snap1 tank/postgres@snap1#bkmrk
* Bookmarks allow send -i from snapshots that were already deleted
*/ tmpfs and FUSE
tmpfs
tmpfs stores files entirely in the page cache and swap. It grows and shrinks dynamically, is swap-backed, and doesn't need a backing block device. This is what makes it fundamentally different from a ramdisk: a ramdisk is a block device with fixed capacity, pinned in RAM; tmpfs competes for memory with everything else and pages to swap under pressure.
/* tmpfs implementation:
* - Uses generic VFS infrastructure: dentries, inodes, page cache
* - Each file's struct address_space is backed by anonymous pages (no device)
* - Page cache pages are SWAP_BACKED → can be swapped out
* - Inode pages are also in page cache (tmpfs uses shmem.c which handles both)
*
* shmem.c (shared memory filesystem) implements tmpfs + /dev/shm + POSIX shared memory
*
* Mount options:
* size=2G — maximum size (default: 50% of RAM)
* nr_blocks=1M — maximum blocks (size/nr_blocks = block size, usually 4096)
* nr_inodes=1M — maximum inodes (for directory count)
* mode=1777 — permissions
* uid=0, gid=0 — owner
* mppol=madv_merge — MADV_MERGEABLE hint for transparent huge pages (not on tmpfs)
*
* # Create a 1GB tmpfs for scratch work:
* mount -t tmpfs -o size=1G,mode=1777 tmpfs /mnt/scratch
*
* # tmpfs in fstab:
* tmpfs /mnt/scratch tmpfs size=1G,mode=1777 0 0
*
* # Check usage:
* df -h /mnt/scratch
* Filesystem Size Used Avail Use% Mounted on
* tmpfs 1.0G 0 1.0G 0% /mnt/scratch
*
* # Check inodes:
* df -i /mnt/scratch
*
* # Default /dev/shm, /run, /sys/fs/cgroup (cgroup v2) are all tmpfs.
* # Check: mount | grep tmpfs
*/
/* tmpfs vs ramdisk:
* Ramdisk:
* mkfs.ext4 /dev/ram0
* mount /dev/ram0 /mnt/rd
* Fixed size, RAM pinned, never swaps
* Used for fixed-size, always-needed storage
*
* tmpfs:
* mount -t tmpfs -o size=512M tmpfs /mnt/ts
* Grows/shrinks dynamically
* Can swap out under memory pressure
* Default for /dev/shm, /run, /tmp (on systemd)
*/
/* systemd /tmp handling:
* systemd creates /tmp as a tmpfs by default (PrivateTmp=yes for services)
* But /var/tmp may be persistent (systemd-tmpfiles -d /var/tmp)
* tmpfiles.d configuration for temporary data:
* d /var/tmp 1777 root root -
* D /tmp 1777 root root -
* # 'D' creates and clears on boot; 'd' creates but doesn't clear
*/ FUSE
FUSE lets you implement a filesystem in userspace. The kernel module (/dev/fuse) multiplexes VFS calls over a file descriptor. The userspace daemon reads operation descriptors, processes them, and writes responses. The performance cost is high: every operation is at least 2 context switches (kernel→userspace→kernel). But the programmability tradeoff is compelling — you can prototype or implement complex filesystems without kernel module development.
/* FUSE operation flow (fuse.ko):
*
* 1. Mount: mount -t fuse /dev/fuse /mnt/fuse
* - open(/dev/fuse) → fd
* - mount(fd, target, "fuse", 0, options)
* - /dev/fuse becomes the communication channel
*
* 2. Userspace daemon (libfuse / pyfuse3 / go-fuse) opens /dev/fuse
*
* 3. Request loop:
* - read(fd, fuse_in_header, sizeof(fuse_in_header))
* - switch (in.opcode):
* FUSE_LOOKUP → call lookup(name) → write resp
* FUSE_GETATTR → call getattr() → write resp
* FUSE_OPEN → call open(path, flags) → write resp
* FUSE_READ → call read(fh, offset, size) → write resp
* FUSE_WRITE → call write(fh, buf, offset) → write resp
* FUSE_READDIR → call readdir(fh, buf) → write resp
* FUSE_MKDIR → call mkdir(path, mode) → write resp
* ... (many more opcodes)
*
* 4. Kernel: VFS dispatches to concrete filesystem
* - ext4_lookup() → does lookup in ext4 directory data
* - For FUSE: fuse_lookup() → sends request to userspace daemon
* - Waits for response → returns dentry/inode
*
* 5. Cache behavior:
* - FUSE daemon can set FUSE_POSIX_LOCKS, FUSE_ATOMIC_O_TRUNC
* - By default, FUSE is cache-all (kernel dentry/inode cache like native FS)
* - To disable cache: -o default_permissions (forces permission checks every op)
*
* # Mount an SSH remote as a filesystem:
* sshfs user@remotehost:/data /mnt/remote -o reconnect,ServerAliveInterval=60
* # Every ls, cat, write → round-trip to remote host
*
* # Encryption filesystem (encfs):
* encfs /data/encrypted /mnt/decrypted
* # /mnt/decrypted reads/writes → auto-encrypt/decrypt files in /data/encrypted
*
* # S3-backed filesystem (goofys):
* goofys bucket /mnt/s3
* # S3 as a filesystem (eventual consistency, no random writes, great for streaming)
*
* Performance comparison (fio on localhost vs FUSE vs native):
* Native ext4: ~500K IOPS, ~3 μs latency
* sshfs (loopback): ~50K IOPS, ~20 μs latency
* FUSE (empty fs): ~200K IOPS, ~5 μs latency (for pure memory operations)
*
* The gap comes from the userspace/kernel boundary and the serialization
* of FUSE operations through a single file descriptor.
*/
/* Writing your own FUSE filesystem (Python example):
*
* import os
* from fuse import FUSE
*
* class MyFS:
* def getattr(self, path, fh=None):
* return dict(st_mode=0o100644, st_size=0)
* def readdir(self, path, fh):
* return ['.', '..', 'hello.txt']
* def open(self, path, flags):
* return 0
* def read(self, path, size, offset, fh):
* return b"Hello from FUSE!"[offset:offset+size]
*
* FUSE(MyFS(), '/mnt/fuse', foreground=True)
*
* That's a working filesystem. Every read() hits your Python code.
* For production: use async I/O, batch operations, caching.
*/ Direct I/O vs Buffered I/O
The choice between buffered and direct I/O fundamentally changes how the kernel handles your requests. Buffered I/O goes through the page cache (fast on hit, slow on miss). Direct I/O bypasses the page cache and submits bios directly to the block layer, making it useful for applications that manage their own cache (databases) or need predictable I/O patterns.
/* Buffered I/O (default, every read/write that doesn't specify O_DIRECT):
*
* read(fd, buf, 4096):
* vfs_read() → file->f_op->read_iter()
* generic_file_read_iter() → filemap_read()
* 1. page_cache_seek_to_page() → look up in radix tree by file offset
* 2. If page found and up-to-date → copy to userspace → done
* 3. If page not found → do_page_cache_seek()
* → alloc temp page → submit bio (read from device) → wait for completion
* → mark page up-to-date, insert into radix tree
* 4. copy_page_to_iter() → copy from page struct to userspace buf
* 5. unlock_page() → page back into LRU or reclaim list
*
* write(fd, buf, 4096):
* generic_file_write_iter() → filemap_write()
* 1. find_or_create_page() in page cache at file offset
* 2. copy_from_iter_to_page() → copy userspace buf to page
* 3. SetPageDirty() → marks page for writeback
* 4. file_update_time() → marks inode mtime/ctime dirty
* 5. Later: pdflush / flusher threads write dirty pages via ->writepages
*
* Key benefit: hot data served from RAM, zero disk I/O
* Key cost: extra memory copies, page allocation on first write, cache pressure
*/
/* Direct I/O (O_DIRECT):
* - Bypasses the page cache completely
* - bio submitted directly to the block layer → device
* - Alignment requirements:
* - buf must be aligned to logical block size (typically 4096)
* - offset must be aligned to logical block size
* - size must be a multiple of logical block size
* - Write: data goes directly from userspace buf → block device (bounce buffer used if needed)
* - Read: data goes directly from block device → userspace buf
*
* io_uring + O_DIRECT on regular files (Linux 5.19+):
* - Pre-5.19: O_DIRECT on regular files required filesystem-specific support
* - 5.19+: block layer supports O_DIRECT on any regular file
* - io_uring registered ring: pre-registered user buffers (fixed buffers)
* → eliminates per-I/O copy to/from kernel
*
* When to use O_DIRECT:
* - Databases that manage their own cache (PostgreSQL, MySQL InnoDB)
* - Applications that do large streaming I/O and don't benefit from page cache
* - Avoiding double-caching: app cache + kernel page cache both hold same data
*
* When NOT to use O_DIRECT:
* - Small random I/O (no sequential read-ahead benefit)
* - Frequently re-read data (page cache would serve it faster than disk)
* - Write-heavy workloads with small updates (CoW filesystems fragment badly)
*
* Example PostgreSQL: effective_io_concurrency = 2 (or 4)
* → PostgreSQL uses O_DIRECT for table files (but not WAL, which uses fsync)
* → Set max_io_concurrency based on device IOPS / number of spindles
*
* Example MySQL InnoDB: innodb_flush_method = O_DIRECT
* → InnoDB double-buffer: writes go through InnoDB buffer pool AND O_DIRECT
* → Sync to disk via fsync; innodb buffer pool for reads
*
* # Verify O_DIRECT works on a filesystem:
* strace -e trace=open,read,write a.out 2>&1 | grep O_DIRECT
* open("/data/db", O_RDWR|O_DIRECT) = 4
* # If open succeeds without EINVAL, O_DIRECT is supported
*/
/* io_uring with registered ring (best performance):
* struct io_uring ring;
* io_uring_queue_init(32, &ring, 0);
*
* // Register user buffers (avoids copy on each I/O):
* struct iovec iov[4];
* iov[0].iov_base = buf;
* iov[0].iov_len = 4096;
* io_uring_register_buffers(&ring, iov, 1);
*
* // Submit read via fixed buffer:
* struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
* io_uring_prep_read_fixed(sqe, fd, buf, 4096, 0, 0);
* sqe->flags |= IOSQE_FIXED_FILE;
* io_uring_submit(&ring);
*
* // For regular files with O_DIRECT (Linux 5.19+):
* open(path, O_RDWR|O_DIRECT) → fd
* io_uring_register_files(&ring, &fd, 1)
* // Then use IORING_OP_WRITE with the registered fd
*
* io_uring_peek_batch(&ring, &cqe, 1) → poll for completion
*
* Latency: io_uring with registered ring + NVMe: ~1-2 μs per I/O
* Throughput: 1M+ IOPS on fast NVMe with large queue depth
*/ Comparison Matrix
| Feature | ext4 | XFS | Btrfs | ZFS |
|---|---|---|---|---|
| Layout | Block groups + extents | Allocation groups + extents | CoW b-trees | CoW with variable records |
| Snapshots | No (use LVM) | No (use LVM) | Built-in, O(1) | Built-in, O(1) |
| Checksums | Metadata only | Metadata only | Data + metadata | Data + metadata |
| Built-in RAID | No | No | Yes (RAID 0/1/10/5/6) | Yes (mirror, raidz) |
| Shrink | Yes (offline) | No | Yes | No |
| Best for | General purpose, boot | Multi-TB, parallel workloads | Snapshots, single-host | Storage server, archive |
| Max filesystem | 1 EiB | 8 EiB | 16 EiB | 16 EiB |
| Max file size | 16 TiB (4K blocks) | 8 EiB | 16 EiB | 16 EiB |
| Compression | No (external) | No | zlib, zstd (inline) | lz4, zstd, gzip |
| Deduplication | No | No | No (ext. tools) | Yes (block-level) |
| Encryption | No (fscrypt) | No | No (fscrypt) | Yes (native) |
Tradeoffs
- One syscall API works on every filesystem and remote backend
- Page cache, dcache, and inode cache are shared infrastructure
- You can switch filesystems without changing applications
- FUSE lets you prototype filesystems safely
- Block layer handles scheduling uniformly across devices
- FUSE adds 2-10x overhead from extra context switches
- Each filesystem has unique mount options and tunables
- CoW filesystems fragment under random writes (databases hate them)
- fsync semantics differ across filesystems despite POSIX claims
- Page cache and dentry cache pressure can cause unexpected stalls under memory pressure
Frequently Asked Questions
What is the VFS and why does it exist?
The Virtual File System is a kernel layer that defines a uniform interface for all filesystems: open, read, write, stat, mmap, etc. Each concrete filesystem (ext4, XFS, Btrfs, NFS, FUSE) implements a struct of function pointers and registers it with the VFS. When you call read() on an FD, the syscall handler looks up the FD's struct file, follows file->f_op->read_iter, and dispatches to the right filesystem. This is why /etc/hosts on ext4 and a remote NFS file behave the same to user programs: they share the VFS layer above. The VFS also caches dentries and inodes globally, so path resolution skips most filesystem-specific work for hot paths.
What's the difference between ext4 extents and Btrfs CoW?
Ext4 stores file data in extents — contiguous runs of blocks described by (logical_offset, physical_block, length). Overwriting a block updates the data in place; the extent map doesn't change. This is fast and simple. Btrfs is copy-on-write: writing a block allocates a new physical block, writes the new data there, then updates the metadata tree to point at the new location. The old block is reachable from snapshots and freed only when no snapshot references it. This makes snapshots O(1) and gives you bit-rot detection (every block has a checksum), at the cost of write amplification and fragmentation under random-write workloads.
When should I use XFS over ext4?
XFS scales better for very large filesystems (multi-TB to PB) and parallel workloads. Its allocation-group design lets multiple CPUs allocate space concurrently without contention; ext4's block-group locking is more serialized. XFS handles metadata-heavy workloads (millions of small files, fsync-heavy databases) noticeably better. The default on RHEL since 7 is XFS for that reason. ext4 wins on small filesystems (<100GB), boot partitions, and anywhere you might need to shrink (XFS only grows, never shrinks).
What does FUSE actually do?
FUSE (Filesystem in Userspace) is a kernel module that proxies VFS calls to a userspace daemon. Mounting a FUSE filesystem registers a special device (/dev/fuse) and routes VFS operations to whatever process opened it. The userspace daemon implements lookup, getattr, read, write, etc., as callbacks. Examples: sshfs, encfs, gocryptfs, GlusterFS clients. Cost: every operation is a context switch userspace -> kernel -> userspace -> kernel, so FUSE is 2-10x slower than native filesystems. Benefit: you can implement a filesystem in Python without kernel risk.
Why is tmpfs not just a ramdisk?
A ramdisk is a fixed-size block device backed by RAM, formatted with a normal filesystem (ext4 etc.). tmpfs is a filesystem that stores its data directly in the page cache and swap. It grows and shrinks dynamically, doesn't need a block device, and pages can be swapped out under memory pressure (unlike a ramdisk's pinned RAM). Default mounts of /dev/shm, /tmp on systemd, and /run are all tmpfs. The tradeoff: tmpfs files only exist in memory; on reboot or umount they vanish.
What is ZFS's ARC and why does it matter?
The Adaptive Replacement Cache is ZFS's page cache. Unlike Linux's page cache (LRU-based), ARC tracks two lists — recently used (MRU) and frequently used (MFU) — and adapts the split between them based on the workload. It's much better than LRU at keeping hot data in cache when there's a working set plus a streaming scan that would otherwise flush everything. ARC lives in kernel memory (or, on Linux, in a slab cache outside the regular page cache), which is why ZFS is famously RAM-hungry. L2ARC extends it to a fast SSD.
What happens when I fsync a file?
fsync() walks the page cache for the file's inode, gathers all dirty pages, submits writeback for data blocks, waits for all I/O to complete, then logs a commit record in the filesystem journal. The journal ensures that if the system crashes, the file's metadata and data are in a consistent state — either fully written or fully rolled back. Without a journal, a crash mid-write could leave the inode pointing at uninitialized blocks. ext4 uses ordered data mode (default): data blocks hit disk before the commit record. XFS uses deferred logging: many metadata changes are batched into a single journal transaction.
What is the block layer and why should I care?
The block layer sits between the filesystem and the physical device driver. Filesystems issue bios (block I/O descriptors) for their logical block ranges. The block layer queues these bios through an I/O scheduler (or 'elevator') which merges adjacent requests, sorts by physical location, and dispatches them to the device driver. Modern NVMe devices use 'multi-queue' block layer: each CPU core has its own software queue feeding hardware submission queues, eliminating lock contention at high IOPS. The choice of scheduler (none, mq-deadline, bfq, kyber) dramatically affects latency vs throughput tradeoffs.
How does io_uring interact with regular files?
io_uring on regular files bypasses the page cache when using O_DIRECT flag (since Linux 5.19, regular files can use O_DIRECT without needing an explicit flag on the filesystem). The submission queue (SQ) and completion queue (CQ) are shared memory ring buffers between userspace and kernel — no syscall per I/O. For buffered I/O through io_uring, the kernel still uses the page cache; read/write syscalls are replaced by submission of fixed-size chunks via the ring. io_uring also supports registered buffers (pre-registered memory regions) to avoid per-I/O data copies.
What is the relationship between VFS dentries and the page cache?
Dentries and the page cache are separate caches that cooperate closely. When you open() a file, VFS resolves the path through the dentry cache (dcache) to get the inode. The inode's i_mapping points to an address_space which is the root of the page cache radix tree. Pages are keyed by file offset, not by dentry. Multiple processes with the same file open share the same inode, the same address_space, and therefore the same page cache pages — even if they reached the file via different paths (hard links). This is why hard-linked files share the same content transparently.