Linux I/O Models

From blocking read() to io_uring rings — every way Linux does I/O

I/O is the Linux kernel's biggest performance lever. Every model — blocking, non-blocking, multiplexed, signal-driven, async — represents a different tradeoff between code simplicity, syscall overhead, and concurrency. Pick wrong and you leave 5-10x throughput on the table.

The story has five chapters: blocking I/O (the default), non-blocking + select/poll (the C10K era), epoll (Linux 2.5, the modern norm), POSIX/native AIO (the disappointment), and io_uring (Linux 5.1, the future). Each one moves the boundary of "what does the kernel do without my asking" further out.

Before diving in, know this: epoll is still the default choice for network servers in 2025. io_uring is the upgrade, but epoll is battle-tested, simpler, and ships everywhere. The benchmark tables below will help you decide.

The Five Models at a Glance

Key Numbers

1024

FD_SETSIZE: select's hard cap

~1M

connections per epoll instance

~150 ns

syscall cost on modern x86_64

5.1

Linux kernel that introduced io_uring

2-3x

io_uring throughput vs epoll on small ops

2002

year epoll merged (Linux 2.5.44)

~10 µs

epoll_wait overhead (O(1) delivery)

~1–5 µs

io_uring CQ polling overhead

~200–500 ns

context switch to kernel (modern x86)

64 B

SQE/CQE size in io_uring rings

Syscall Overhead: The Hidden Multiplier

Every I/O model ultimately calls the kernel. The question is how often and what work the kernel does on each call. Syscall overhead is not just the mode switch — it's the kernel's internal locking, refcounting, and validation work that happens before control returns to your code.

What ~150 ns Actually Buys You on x86_64

CPU cycle breakdown for a simple read() syscall on a warm cache:

  User → Kernel transition (syscall entry)    ~50 cycles  (~25 ns)
  Argument validation / security check        ~20 cycles  (~10 ns)
  File table lookup (fd → struct file)        ~30 cycles  (~15 ns)
  Permission / capability check               ~15 cycles  (~8 ns)
  Scheduler preemption check                  ~20 cycles  (~10 ns)
  Kernel → User transition (syscall return)   ~50 cycles  (~25 ns)
  ─────────────────────────────────────────────────────────────
  Total per syscall (best case, hot path)     ~185 cycles (~92 ns)

  Add a context switch (thread descheduled):  +2,000–3,000 cycles (~1–1.5 µs)
  Add a cache miss on the file struct:         +50–200 cycles
  Add a cross-NUMA memory access:              +100–300 cycles

The strace cost itself is measurable — running strace -c ./my_server while your server handles 500k req/s adds 2–5% CPU overhead from the tracing infrastructure. But the real damage is invisible: at 1M ops/s with epoll, you're doing 1M extra syscalls just to check readiness. io_uring eliminates that by batching.

Call Frequency by Model

Model	Syscalls per connection per event	Example: 10k conns, 1% active
Blocking (1 thread/conn)	1–2 per I/O (read + write)	10,000 threads = 10k kernel objects
select/poll	1 per iteration (select/poll) + 1–2 per active FD	O(N) scan all 10k each iteration
epoll (level-triggered)	1 (epoll_wait) + 1–2 per active FD	100 active × (read+write) = O(1) wait
epoll (edge-triggered)	1 (epoll_wait) + drain-to-EAGAIN per active FD	Same, but requires careful drain loop
io_uring (basic)	1 (submit batch) + 1 (CQ wait/polling)	100 ops submitted in one syscall
io_uring (SQPOLL)	0 for steady-state high-throughput	Zero-syscall steady state

How the Kernel Wakes Your Code: A Deep Dive

How epoll Works Internally

When you call epoll_create1(EPOLL_CLOEXEC), the kernel allocates an eventpoll object and creates an anonymous inode with an associated file. The FD you get back is a reference to that file. This is why epoll FDs can't be shared across fork() — the internal state lives in the kernel, not in userspace.

Each registered FD creates a struct epitem in a red-black tree keyed by (fd, global sequence). The RB tree enables O(log n) insert/delete. When you call epoll_ctl(ADD), the kernel walks the file's file->f_op->poll method to get an initial state and calls eventpoll_poll() to check if the FD is already ready.

/* Simplified kernel pseudocode for epoll_wait wake-up path */

1. epoll_wait(epfd, events, maxevents, timeout)
   → eventpoll_wait()
   → if (timeout == 0) check ready list immediately
   → else schedule on the wait queue of the epoll instance

2. When a registered FD becomes ready (e.g., data arrives on a socket):
   a. socket's TCP stack sets SK_DATA_READY
   b. interrupt fires → NIC driver → net_rx_action()
   c. TCP handler calls sk_data_ready() callback
   d. the socket's poll() wakes its specific wait queue
   e. but epoll also registered a callback via ep_poll_callback()
   f. ep_poll_callback() does:
      - lock(eventpoll->lock)
      - if (ep_item_ready(epi)) already on ready list → skip
      - list_add_tail(&epi->rdllink, &eventpoll->rdllist)
      - wake_up_locked(&eventpoll->wq)   ← wake epoll_wait

3. epoll_wait wakes up, acquires lock, copies rdllist to user events[]
   → clears the ready list
   → returns count of ready FDs
   → user-space: O(number_of_ready_FDs), not O(total_registered_FDs)

The Ready List and Wake-Up Path

The critical invariant: epoll's ready list is draining-based. When you call epoll_wait, the kernel atomically swaps out the ready list and returns it to you. If you drain only some FDs (because you chose to read only some of them), the others stay pending and epoll_wait returns immediately with them next time (for level-triggered) or waits (for edge-triggered — only fires on new transitions).

Level vs Edge: The Kernel's View

Level-Triggered (POLLHUP / POLLIN)

Kernel state: fd has data present

Your code calls epoll_wait():
  → kernel checks: is fd still readable?
  → yes → return fd in events[]
  → you read() some but not all data
  → you call epoll_wait() again
  → kernel checks: still readable?
  → yes (data still there) → return fd again
  → ... repeat until you drain to EAGAIN

The kernel re-checks on every call. You'll see the same FD on every epoll_wait until it's drained. Safe but noisy.

Edge-Triggered (EPOLLET)

Kernel state: transition from "no data" to "data present"

Your code calls epoll_wait():
  → kernel checks: did fd's state transition?
  → YES (first call, new data arrived) → return fd
  → you read() all available data to EAGAIN
  → you call epoll_wait() again
  → kernel checks: did fd transition again?
  → NO (state didn't change, still readable)
  → skip this fd, wait on wq
  → NEW data arrives: new transition → wake, return fd

The kernel only notifies on state transitions. You must drain to EAGAIN or miss new connections/data.

⚠️ The EPOLLET Miss Condition

With edge-triggered epoll on a listening socket, if you call accept() once and there are N pending connections, you receive one EPOLLIN event. The other N-1 connections sit in the accept queue but produce no additional wake-ups — because the kernel sees no new transition. The fix: loop on accept() until it returns EAGAIN.

When to Use Level vs Edge

Use case	Recommended mode	Why
Simple echo server, drain-until-EAGAIN	Edge (EPOLLET)	Highest efficiency, exactly one wake-up per batch
Interactive terminal / stdin	Level (default)	Data may arrive incrementally; you want every chance to read
Accept loop for listening socket	Edge (EPOLLET) + loop to EAGAIN	Correct and efficient; without loop, you'll drop connections
Writing to a socket with flow control	Edge (EPOLLOUT)	Wake on transition to writable; write until EAGAIN; wait for next transition
Debugging event loop logic	Level (default)	Forgiving — you won't miss events due to partial handling

1. Blocking I/O

The simplest model. You call read(), your thread sleeps, the kernel wakes it when data is ready. Code is trivially correct. The cost is one thread per concurrent I/O operation, and every context switch to/from the kernel burns ~150–300 ns.

State Machine

                      ┌─────────────────────────────────────┐
                      │         TASK_RUNNING                │
                      │   (thread executing on CPU)          │
                      └──────────────┬──────────────────────┘
                                     │ read(fd, buf, n)
                                     ▼
                      ┌─────────────────────────────────────┐
                      │       TASK_INTERRUPTIBLE            │
                      │   (sleeping, waiting for data)      │
                      │                                     │
                      │   Kernel: page cache hit? ──YES──► │ Return data
                      │              │                      │ (TASK_RUNNING)
                      │              NO                     │
                      │              ▼                      │
                      │   Kernel: data from NIC buffer?     │
                      │              │                      │
                      │              ├─YES─► DMA to RAM ──► │
                      │              │       (scheduled)    │
                      │              NO                     │
                      │              ▼                      │
                      │   (sleep until interrupt fires)     │
                      └──────────────┬──────────────────────┘
                                     │ interrupt (NIC / disk)
                                     ▼
                      ┌─────────────────────────────────────┐
                      │        TASK_RUNNING                 │
                      │   (woken, syscall returns n>0)      │
                      └─────────────────────────────────────┘

Context Switch Overhead

Every blocking read involves at minimum two mode switches (user → kernel, kernel → user) and potentially a context switch if the scheduler deschedules the thread while it sleeps. With 10,000 threads all blocking on different sockets, you have 10,000 kernel objects, 10,000 stack pages, and the scheduler walking millions of runnable threads on each interrupt. This is why the C10K problem hit in 1999 — blocking I/O can't scale past ~1,000–10,000 concurrent connections on commodity hardware.

/* Classic blocking echo server: one thread per connection */
void *handle_client(void *arg) {
    int fd = *(int *)arg;
    char buf[1024];
    while (1) {
        ssize_t n = read(fd, buf, sizeof(buf));  /* blocks here */
        if (n <= 0) break;
        write(fd, buf, n);
    }
    close(fd);
    return NULL;
}

int main() {
    int listen_fd = socket(...);
    bind(listen_fd, ...);
    listen(listen_fd, 128);

    while (1) {
        int conn_fd = accept(listen_fd, NULL, NULL);  /* blocks */
        pthread_t t;
        pthread_create(&t, NULL, handle_client, &conn_fd);
    }
}

Spawning a thread per connection is not naive — it's a valid strategy up to ~5,000–10,000 connections on modern hardware with enough RAM. Go's goroutines and Java virtual threads make this approach ergonomic even at higher concurrency, with the runtime doing the epoll multiplexing underneath.

2. Non-blocking I/O + select/poll

Set a file descriptor to non-blocking mode with O_NONBLOCK (or O_NDELAY, which is similar but also sends SIGTTIN/SIGTTOU — avoid it). A read() on a non-blocking socket with no data available returns -1 with errno = EAGAIN (or EWOULDBLOCK, same value). You loop polling every FD until one returns data.

EAGAIN is not an error

int fd = open("/dev/null", O_RDONLY | O_NONBLOCK);
char buf[4096];

ssize_t n = read(fd, buf, sizeof(buf));
if (n < 0 && errno == EAGAIN) {
    /* No data right now. Poll or retry later. */
} else if (n < 0) {
    /* Real error */
}

select(2) and poll(2)

select(2) and poll(2) are the wait functions. Both block until at least one of the given file descriptors is ready for I/O. They return the count of ready FDs, but you must scan all of them to find which ones.

/* select: fixed-size bitmap, FD_SETSIZE-limited (often 1024) */
fd_set read_fds, master;
FD_ZERO(&master);
FD_SET(fd, &master);

while (1) {
    read_fds = master;                        /* copy: select clobbers */
    struct timeval tv = { .tv_sec = 5, .tv_usec = 0 };
    int ready = select(max_fd + 1, &read_fds, NULL, NULL, &tv);
    if (ready < 0) break;
    for (int fd = 0; fd <= max_fd; fd++) {
        if (FD_ISSET(fd, &read_fds)) {
            /* fd is ready */
            handle(fd);
        }
    }
}

/* poll: array of pollfd structs, no hard FD limit */
struct pollfd fds[16];
fds[0].fd = listen_fd;
fds[0].events = POLLIN;       /* want to read (accept) */
fds[1].fd = conn_fd;
fds[1].events = POLLIN;       /* want to read from connection */

while (1) {
    int n = poll(fds, 2, 5000);   /* timeout in ms, -1 = block */
    if (n < 0) break;
    if (n == 0) continue;        /* timeout */
    for (int i = 0; i < 2; i++) {
        if (fds[i].revents & (POLLIN | POLLHUP | POLLERR)) {
            handle(fds[i].fd);
        }
    }
}

The O(N) Scalability Problem

select/poll: O(N) per call

fd_set read_fds = master;     /* copy ~1024-bit bitmap to kernel */
select(max_fd+1, &read_fds, ...);
/* Kernel walks ALL bits, even FDs with no interest.
   User scans ALL bits to find which are set. O(N) always. */

struct pollfd fds[N];            /* pollfd array copy to kernel */
poll(fds, N, timeout);
/* Kernel walks ALL pollfd entries checking each .fd and .events.
   User scans ALL entries checking .revents. O(N) always. */

With 100,000 idle connections, select() or poll() walks all 100,000 FDs on every call, even if only one has data. At 10,000 req/s with 100,000 idle connections, you burn cycles checking empty sockets millions of times per second. This is the core scalability failure that epoll solves.

3. I/O Multiplexing: epoll(7)

epoll (Linux 2.5.44, 2002) moves the FD set into the kernel. You create an epoll instance, register FDs you care about with epoll_ctl, then call epoll_wait which returns only the ready FDs. Amortized O(1) per event — the kernel only tells you about FDs that actually have activity.

Three System Calls: create, control, wait

/*
 * Step 1: Create an epoll instance.
 * epoll_create1(flags) — prefer this over epoll_create().
 * EPOLL_CLOEXEC: close(epfd) on execve() prevents FD leak to children.
 */
int ep = epoll_create1(EPOLL_CLOEXEC);
if (ep < 0) { perror("epoll_create1"); exit(1); }

/*
 * Step 2: Register FDs of interest with epoll_ctl.
 * EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL.
 *
 * events is a bitmask:
 *   EPOLLIN   — ready for read
 *   EPOLLOUT  — ready for write
 *   EPOLLET   — edge-triggered mode (see below)
 *   EPOLLONESHOT — oneshot: auto-remove after one event
 *   EPOLLRDHUP — peer closed connection (Linux 2.6.17+)
 */
struct epoll_event ev = {
    .events  = EPOLLIN | EPOLLET,   /* read, edge-triggered */
    .data.fd = listen_fd,
};
epoll_ctl(ep, EPOLL_CTL_ADD, listen_fd, &ev);

/* When listen_fd becomes readable: a new connection is ready for accept() */
struct epoll_event ev_conn = {
    .events  = EPOLLIN | EPOLLET,
    .data.fd = conn_fd,
};
epoll_ctl(ep, EPOLL_CTL_ADD, conn_fd, &ev_conn);

/*
 * Step 3: Wait for events.
 * epoll_wait returns n >= 0 (number of ready FDs), or -1 on error.
 * The events[] array is populated with only the ready FDs — no scanning needed.
 */
struct epoll_event events[64];
while (1) {
    int n = epoll_wait(ep, events, 64, -1);  /* -1 = infinite timeout */
    if (n < 0) { if (errno == EINTR) continue; break; }

    for (int i = 0; i < n; i++) {
        int fd = events[i].data.fd;
        uint32_t e = events[i].events;

        if (fd == listen_fd && (e & EPOLLIN)) {
            /* Accept all pending connections */
            while (1) {
                int cfd = accept(listen_fd, NULL, NULL);
                if (cfd < 0) break;
                struct epoll_event e = { .events = EPOLLIN | EPOLLET, .data.fd = cfd };
                epoll_ctl(ep, EPOLL_CTL_ADD, cfd, &e);
            }
        } else if (e & (EPOLLIN | EPOLLHUP | EPOLLRDHUP)) {
            /* Drain the connection until EAGAIN */
            char buf[4096];
            while (1) {
                ssize_t r = read(fd, buf, sizeof(buf));
                if (r <= 0) { close(fd); break; }
                write(fd, buf, r);  /* echo */
            }
        }
    }
}

Level-Triggered vs Edge-Triggered (EPOLLET)

Level-Triggered (default)

read() → EAGAIN → re-call read()
 ────────────────────────────────
 ready state (data present)
        │
        ├──── epoll_wait returns, you read()
        ├──── epoll_wait returns, you read()
        └──── (until drained)

epoll_wait fires every time you call it while data is present. Forgiving but noisier. Equivalent to poll() semantics.

Edge-Triggered (EPOLLET)

ready state (data present)
        │
        └──── epoll_wait fires ONCE
              → must read ALL data
                until EAGAIN
              → only fires again on NEW data
 ────────────────────────────────

epoll_wait fires exactly once per transition (idle → ready). More efficient, but you must drain to EAGAIN or you'll miss events. The standard pattern for high-performance servers.

Thundering Herd and SO_REUSEPORT

In the early days, when one connection arrived on a listening socket, all processes blocked in accept() would be woken — but only one would get the connection. Linux 2.6 fixed this kernel-side, but it reappears with epoll ET mode if you register the same listening socket in multiple epoll instances: a single SYN can produce many-ready events.

/* DON'T do this — same listen_fd in multiple epoll instances
   causes duplicate wakeups on edge-triggered accept */
int ep1 = epoll_create1(0);
int ep2 = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = listen_fd };
epoll_ctl(ep1, EPOLL_CTL_ADD, listen_fd, &ev);
epoll_ctl(ep2, EPOLL_CTL_ADD, listen_fd, &ev);  /* thundering herd! */

SO_REUSEPORT (Linux 3.9) solves load balancing at the socket level: each worker process gets its own listening socket bound to the same port. The kernel distributes incoming connections round-robin across the sockets, eliminating both the thundering herd and the need for a master process/thread that accepts and hands off connections.

/* Each worker process: own listening socket, same port */
int fd = socket(AF_INET, SOCK_STREAM, 0);
int enable = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &enable, sizeof(enable));

struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port   = htons(8080),
    .sin_addr.s_addr = htonl(INADDR_ANY),
};
bind(fd, (struct sockaddr *)&addr, sizeof(addr));
listen(fd, 128);

/* epoll the listening socket in each worker — no contention,
   kernel distributes connections across workers round-robin */

EPOLLONESHOT: One Event, Then Re-arm

EPOLLONESHOT tells epoll to disable the FD after one event. You must re-enable it with EPOLL_CTL_MOD after handling. This is useful when a single event requires multiple I/O operations and you don't want intermediate states triggering spurious wakeups.

struct epoll_event ev = {
    .events = EPOLLIN | EPOLLET | EPOLLONESHOT,
    .data.fd = fd,
};
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev);

/* In the event loop: */
if (events[i].events & EPOLLIN) {
    ssize_t n = read(fd, buf, sizeof(buf));
    /* Re-enable so we get the next message */
    struct epoll_event ev = {
        .events = EPOLLIN | EPOLLET | EPOLLONESHOT,
        .data.fd = fd,
    };
    epoll_ctl(ep, EPOLL_CTL_MOD, fd, &ev);
}

3b. kqueue(2) — Linux and macOS/BSD

kqueue/kevent is the BSD equivalent of epoll, available on macOS, FreeBSD, and OpenBSD. Linux does not have kqueue natively (epoll is Linux-specific), though the Linux kqueue emulation layer (libkqueue) wraps epoll under the hood for portability. If you're writing cross-platform server code, kqueue is the common denominator.

Unlike epoll's split of epoll_create + epoll_ctl + epoll_wait, kqueue uses a unified kevent() syscall for both registration and waiting. EV_SET() is a convenience macro to build a kevent struct.

/* kqueue is kevent-based */
int kq = kqueue();
if (kq < 0) perror("kqueue");

struct kevent change;
EV_SET(&change, listen_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
struct kevent changes[1] = { change };
struct kevent events[64];

/* kevent(kq, changelist, nchanges, eventlist, nevents, timeout) */
while (1) {
    int n = kevent(kq, changes, 1, events, 64, NULL);  /* NULL = block */
    if (n < 0) break;

    for (int i = 0; i < n; i++) {
        int fd = (int)events[i].ident;
        int filter = events[i].filter;

        if (filter == EVFILT_READ) {
            if (fd == listen_fd) {
                /* accept */
                int cfd = accept(fd, NULL, NULL);
                if (cfd >= 0) {
                    struct kevent ev;
                    EV_SET(&ev, cfd, EVFILT_READ, EV_ADD, 0, 0, NULL);
                    kevent(kq, &ev, 1, NULL, 0, NULL);
                }
            } else {
                /* read from connection */
                char buf[4096];
                ssize_t r = read(fd, buf, sizeof(buf));
                if (r <= 0) {
                    close(fd);
                    /* remove from kqueue */
                    struct kevent ev;
                    EV_SET(&ev, fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
                    kevent(kq, &ev, 1, NULL, 0, NULL);
                }
            }
        }
    }
}

kqueue Filter Flags

Filter	What it watches	Return when
`EVFILT_READ`	FD readable	data available / connection arrived / EOF
`EVFILT_WRITE`	FD writable	buffer space available / connection established
`EVFILT_VNODE`	file state changes	read, write, rename, delete, chmod
`EVFILT_PROC`	process events	fork, exec, exit, signal
`EVFILT_SIGNAL`	signal delivery	signal is delivered to process
`EVFILT_TIMER`	interval timers	timeout elapsed (oneshot or repeating)

epoll vs kqueue Comparison

Aspect	epoll	kqueue
Availability	Linux only	macOS, FreeBSD, OpenBSD, Solaris
Syscall count	3 (create + ctl + wait)	2 (kqueue + kevent for both)
Edge-triggered	EPOLLET flag	EV_CLEAR flag
Oneshot	EPOLLONESHOT	EV_ONESHOT
Wakeup on delete	No (removed silently)	EV_DISABLE (or clear filter)
Per-FD data	epoll_data union (fd/ptr/u32/u64)	ident field (void*)
FD types	Sockets, pipes, eventfd, timerfd, epoll, inotify	All of the above + vnodes, processes, signals, timers
Linux emulation	—	libkqueue wraps epoll

4. Signal-Driven I/O (SIGIO)

The idea: instead of polling, the kernel sends you a SIGIO signal when a FD becomes ready. You set this up with F_SETOWN and F_SETFL (or F_SETSIG to use a real-time signal instead of SIGIO).

/* Register process or process group to receive SIGIO for fd */
int pid = getpid();
if (fcntl(fd, F_SETOWN, pid) < 0) perror("F_SETOWN");

/* Enable signal-driven I/O: O_ASYNC flag */
int flags = fcntl(fd, F_GETFL);
if (fcntl(fd, F_SETFL, flags | O_ASYNC | O_NONBLOCK) < 0) perror("F_SETFL");

/* Optional: use a specific real-time signal instead of SIGIO */
if (fcntl(fd, F_SETSIG, SIGRTMIN) < 0) perror("F_SETSIG");

/* Signal handler: must be async-signal-safe */
void handle_sigio(int sig, siginfo_t *info, void *ucontext) {
    /* info->si_fd tells you which FD is ready */
    /* ucontext has the full CPU context if you need to longjmp */
    char buf[4096];
    while (read(info->si_fd, buf, sizeof(buf)) > 0) { /* drain */ }
}

The practical problems: signals are per-process, not per-connection, so you still need to poll to find which FD triggered the signal. With many FDs, signals coalesce — if the kernel sends SIGIO faster than your handler runs, you lose events. F_SETSIG with a real-time signal helps (each SIGRTMIN+N is queued separately), but you still need to maintain state. For these reasons, signal-driven I/O is rarely used in production network servers. It occasionally appears in GUI toolkits for stdin/pipe readiness notification.

5. Asynchronous I/O (AIO)

True asynchronous I/O means: you initiate the operation and get a callback (or poll a completion object) when it's done, without blocking at any point. Linux has two AIO interfaces and both are historically disappointing.

POSIX AIO (glibc): A Thread Pool in Disguise

aio_read(3), aio_write(3), aio_error(3), aio_return(3) are defined by POSIX. On Linux (glibc), they are implemented as a userspace thread pool that calls blocking pread(2)/ pwrite(2). There is no kernel-level async I/O happening.

#include <aio.h>

struct aiocb req = { 0 };
req.aio_fildes  = fd;
req.aio_buf     = buf;
req.aio_nbytes  = 4096;
req.aio_offset  = 0;
req.aio_sigevent.sigev_notify = SIGEV_NONE;  /* poll aio_error() */

/* Initiate async read */
if (aio_read(&req) < 0) { perror("aio_read"); exit(1); }

/* Poll for completion */
while (aio_error(&req) == EINPROGRESS) {
    /* spin — or do other useful work */
    ;
}
ssize_t ret = aio_return(&req);  /* actual bytes read or -1 */

Native Linux AIO (io_setup/io_submit)

The kernel-native AIO interface via io_setup(2), io_submit(2), io_getevents(2) works at the syscall level and can do real kernel-level async I/O — but only for O_DIRECT (raw block device, no page cache). Regular buffered file I/O silently falls back to synchronous. This limitation made it useless for 99% of applications.

#include <linux/aio_abi.h>
#include <sys/syscall.h>

/* You need your own wrapper since glibc doesn't wrap these */
static int io_setup(unsigned nr_events, aio_context_t *ctx) {
    return syscall(SYS_io_setup, nr_events, ctx);
}
static int io_submit(aio_context_t ctx, long nr, struct iocb **iocbpp) {
    return syscall(SYS_io_submit, ctx, nr, iocbpp);
}
static int io_getevents(aio_context_t ctx, long min_nr, long nr,
                        struct io_event *events, struct timespec *timeout) {
    return syscall(SYS_io_getevents, ctx, min_nr, nr, events, timeout);
}

aio_context_t ctx = 0;
io_setup(256, &ctx);   /* 256 concurrent events */

struct iocb cb, *cbs[1];
io_prep_pread(&cb, fd, buf, 4096, 0);   /* fill in the iocb */
cb.aio_sigev_notify = SIGEV_NONE;
cbs[0] = &cb;
io_submit(ctx, 1, cbs);                  /* submit the operation */

/* Wait for completion */
struct io_event events[8];
int n = io_getevents(ctx, 1, 8, events, NULL);
for (int i = 0; i < n; i++) {
    struct iocb *completed = (struct iocb *)(uintptr_t)events[i].obj;
    /* events[i].res = byte count or negative errno */
}

Both AIO interfaces are dead ends. POSIX AIO is a thread pool. Native Linux AIO only works with O_DIRECT. io_uring is what both interfaces tried to be: a generic, kernel-level, high-performance async I/O interface that works on all FD types.

6. io_uring: The Modern Async I/O Interface

io_uring landed in Linux 5.1 (2019). Its key innovation: two lock-free ring buffers shared between userspace and the kernel, mapped via mmap(). No syscall is needed to enqueue a submission or dequeue a completion — you write/read directly to/from the ring. Only one syscall is needed per batch: io_uring_enter(2) to submit pending SQEs and optionally wait for CQEs.

SQ/CQ Ring Architecture

  Userspace                         Kernel
  ┌────────────────────────────┐    ┌────────────────────────────┐
  │  Submission Queue (SQ)      │    │  SQ processing              │
  │  ┌────────────────────┐    │    │  (read SQE, execute)        │
  │  │ SQE: read(3,...)  │────┼───►│                            │
  │  │ SQE: write(4,...)  │────┼───►│  ...                       │
  │  │ SQE: accept(...)  │────┼───►│                            │
  │  └────────────────────┘    │    └────────────────────────────┘
  │         │                 │
  │         │ mmap            │
  │         ▼                 │
  │  ┌────────────────────┐   │
  │  │ Completion Queue    │◄──┼──── CQE written here
  │  │ (CQ)               │   │     after each op
  │  │  CQE: res=1024     │   │
  │  │  CQE: res=-EINTR   │   │
  │  └────────────────────┘   │
  └────────────────────────────┘

  io_uring_enter(ep_fd, to_submit, min_complete, flags, sig)
        ▲
        │ only syscall needed per batch (SYSCALL)
        │
  Userspace poll CQ → no syscall if CQ has entries

Why io_uring Exists: A Short History of Linux I/O Pain

POSIX gave us blocking I/O (simple but unscalable), then select/poll (scalable but O(N) scan), then epoll (O(1) notification). Each step reduced kernel-userspace boundary crossings. io_uring completes the journey: it eliminates the syscall boundary itself for the hot path. You write to the ring, call io_uring_enter, and poll the completion ring. The SQ/CQ rings are lock-free single-producer, single-consumer queues — the kernel and userspace can operate concurrently without locking, as long as they respect the ring boundaries.

The SQE: 64 Bytes That Replace a Syscall

Every submission queue entry is 64 bytes. That sounds large, but it fits a complete read/write/accept operation: opcode, flags, fd, buffer pointer (or iovec), offset, user_data tag, and optional auxiliary data (e.g., personality for security context).

/* SQE structure (simplified, from io_uring.h) */
struct io_uring_sqe {
    __u8    opcode;     /* IORING_OP_* */
    __u8    flags;      /* IOSQE_* */
    __u16   ioprio;     /* for I/O priority */
    __u32   fd;         /* file descriptor */
    __u64   off;        /* offset for random access I/O */
    __u64   addr;       /* pointer to buffer or iovec */
    __u32   len;        /* buffer length or number of iovecs */
    union {
        __u32   rw_flags;   /* read/write flags (ROPREC, etc.) */
        __u32   fsync_flags;
        __u16   poll_events;
    };
    __u64   user_data;  /* tag returned in CQE for identification */
    /* ... more fields (buf_index, personality, etc.) */
};

The CQE is 32 bytes: just user_data, result code, and flags. That's tight — if you submit 256 SQEs and they all complete, you have 256 × 32 = 8KB of completions sitting in the CQ ring, accessible without any kernel call. You consume them at your leisure, then call io_uring_enter() to submit more.

Submission Queue Head Advance: The Lock-Free Trick

The SQ ring has a head (kernel reads from here) and a tail (userspace writes to here). The kernel never writes to the SQ. Userspace never reads from the CQ tail. This strict producer/consumer separation means no locking is needed — just an atomic load/store of the tail pointer by userspace and head pointer by the kernel. The same pattern applies in reverse for the CQ.

The io_uring_enter Syscall: Not Just a Wrapper

int io_uring_enter(int ring_fd, unsigned to_submit,
                  unsigned min_complete, unsigned flags, sigset_t *sig);

/* key parameters: */
/* to_submit:   number of SQEs the kernel should pick up from the SQ ring */
/* min_complete: block until at least this many CQEs are available (0 = don't block) */
/* flags:       IORING_ENTER_* bitmask */
/*              IORING_ENTER_SQ_WAKEUP  — wake the SQPOLL thread */
/*              IORING_ENTER_SQ_WAIT    — wait for space in SQ ring */
/*              IORING_ENTER_EXT_ARG    — extended arguments */

The most common pattern: io_uring_submit(&ring) which calls io_uring_enter(fd, n_submitted, 0, 0, NULL) — submit all pending SQEs, don't block waiting for completions. Then poll the CQ until you have enough completions. At high throughput with SQPOLL, this entire loop runs without a single blocking syscall.

liburing vs Raw Syscalls

You can use io_uring via liburing (the userspace library that ships with Linux, and also as a standalone package), or via raw syscalls wrapped with syscall(SYS_io_uring_enter, ...). liburing is easier: it hides the ring index math and provides helpers like io_uring_prep_read(), io_uring_submit(), io_uring_wait_cqe(). For performance-critical paths, knowing the raw ring protocol lets you eliminate the helper function call overhead — the hot loop just increments a tail pointer and writes SQEs.

Basic Read/Write with liburing

#include <liburing.h>

struct io_uring ring;
int ret = io_uring_queue_init(256, &ring, 0);  /* 256-entry rings */
if (ret < 0) { fprintf(stderr, "queue_init: %s\n", strerror(-ret)); exit(1); }

char buf[4096];
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
sqe->user_data = 42;  /* identify this operation in the CQE */

/* Submit: io_uring_enter() syscall with:
   - to_submit = number of SQEs to process
   - min_complete = how many CQEs to wait for (0 = don't wait)
   - flags = 0, or IORING_ENTER_SQ_WAKEUP to wake the SQ poll thread */
int submitted = io_uring_submit(&ring);   /* returns number submitted */

/* Wait for at least 1 completion */
struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);     /* blocks, calls io_uring_enter */
if (ret < 0) { fprintf(stderr, "wait_cqe: %s\n", strerror(-ret)); exit(1); }

printf("op %u returned %d\n", cqe->user_data, cqe->res);  /* res = bytes or -errno */
io_uring_cqe_seen(&ring, cqe);  /* mark CQE as consumed */

Fixed Buffers and Registered Files

Each SQE/CQE is 64 bytes. With 256 entries, the rings are tiny and fit in L1 cache. But io_uring goes further: fixed buffers let you pre-register memory regions so the kernel doesn't need to pin or map them per operation. Registered files pre-register FDs so epoll_ctl-equivalent operations skip the file table lookup.

/* Pre-register buffers: kernel uses them without address translation overhead */
struct iovec iov[2];
iov[0].iov_base = buf1;  iov[0].iov_len = 4096;
iov[1].iov_base = buf2;  iov[1].iov_len = 8192;
io_uring_register_buffers(&ring, iov, 2);

/* Reference a registered buffer by index in the SQE */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read_fixed(sqe, fd, buf1, 4096, 0, 0);  /* buf_idx=0 */
io_uring_submit(&ring);

/* Pre-register file descriptors */
int fds[2] = { fd1, fd2 };
io_uring_register_files(&ring, fds, 2);

/* Use registered fd by index */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read_fixed(sqe, fd1, buf1, 4096, 0, 0);  /* uses registered FD index */
/* OR with registered buffers: */
io_uring_prep_read_fixed(sqe, fds[0], buf1, 4096, 0, 0);

SQPOLL: Kernel-Side Polling (Zero Syscalls)

In normal mode, you call io_uring_enter() to submit SQEs and wait for CQEs. In SQPOLL mode, you create a kernel thread that continuously polls the SQ ring. You just write SQEs to the ring (no syscall) and poll the CQ ring (no syscall). In steady state at high throughput, this eliminates all I/O-related syscalls.

/* Create SQPOLL io_uring instance
   IORING_SETUP_SQPOLL: kernel creates a background thread
   IORING_SETUP_SQ_AFF: pin that thread to a specific CPU */
struct io_uring_params params = { 0 };
params.flags = IORING_SETUP_SQPOLL;
params.sq_thread_idle = 2000;   /* park thread after 2000ms idle */

int ring_fd = io_uring_queue_init_params(256, &ring, ¶ms);
/* Now: just write SQEs to the ring, poll the CQ ring. */
/* When the background thread is idle for 2s it sleeps, woken by
   io_uring_enter() with IORING_ENTER_SQ_WAKEUP if you need it */

io_uring Operations (IORING_OP_* )

Category	Operations
Basic I/O	`IORING_OP_READ`, `WRITE`, `READ_FIXED`, `WRITE_FIXED`
Socket I/O	`IORING_OP_RECV`, `SEND`, `RECVMSG`, `SENDMSG`
Accept/Connect	`IORING_OP_ACCEPT`, `CONNECT`
File operations	`IORING_OP_OPENAT`, `CLOSE`, `STATX`, `FADVISE`, `MADVISE`
Synchronization	`IORING_OP_FSYNC`, `SYNC_FILE_RANGE`, `FDATASYNC`
Timeout	`IORING_OP_TIMEOUT`, `TIMEOUT_REMOVE`, `WAITID`
Filesystem	`IORING_OP_MKDIRAT`, `UNLINKAT`, `RENAMEAT`, `LINKAT` (Linux 5.6+)
Special	`IORING_OP_NOP`, `ASYNC_CANCEL`, `SETUP_TEST`

Linked SQEs: Atomic Operation Chains

IOSQE_IO_LINK chains operations: the next SQE doesn't start until the current one completes. You can chain open → read → close, or connect → send → shutdown. This models pipeline parallelism without any userspace thread coordination.

/* Chain: open file → read → close — all sequential, no thread needed */
struct io_uring_sqe *sqe;

/* openat */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_openat(sqe, AT_FDCWD, "/etc/hosts", O_RDONLY, 0);
sqe->user_data = 1;
sqe->flags |= IOSQE_IO_LINK;    /* link to next */

sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, -1, buf, 4096, 0);  /* -1 = use fd from linked op */
sqe->user_data = 2;
sqe->flags |= IOSQE_IO_LINK;

sqe = io_uring_get_sqe(&ring);
io_uring_prep_close(sqe, -1);    /* -1 = use fd from linked open */
sqe->user_data = 3;

io_uring_submit(&ring);

Multi-shot Accept

With IOSQE_ASYNC flag on an accept SQE, one SQE can produce multiple CQEs — one per accepted connection. This is exactly what you want for a high-throughput server: submit one "keep accepting" SQE at startup and handle every CQE it produces.

/* Submit one multi-shot accept SQE — produces one CQE per connection */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_accept(sqe, listen_fd, NULL, NULL, SOCK_NONBLOCK);
sqe->flags |= IOSQE_ASYNC;   /* multi-shot: don't re-submit */
sqe->user_data = 0xFFFF;     /* tag to identify accepts */

io_uring_submit(&ring);

while (1) {
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);

    if (cqe->user_data == 0xFFFF) {
        /* New connection accepted. cqe->res = fd or -errno. */
        int conn_fd = cqe->res;
        if (conn_fd >= 0) {
            handle_connection(conn_fd);
        }
    } else {
        /* Normal operation completion */
        printf("op %lu completed: %d\n", cqe->user_data, cqe->res);
    }
    io_uring_cqe_seen(&ring, cqe);
}

io_uring vs epoll: Benchmark Context

The key metric: syscall overhead. epoll returns ready FDs but you still need separate read/write/accept syscalls to act on them. io_uring batches submissions and completions, eliminating per-operation syscall overhead.

Metric	select/poll	epoll	io_uring
Ready-FD notification	O(N) per call	O(1) amortized	O(1) amortized
FD set location	Copied each call	Kernel internal	Kernel internal
Syscalls per event loop iteration	1 (select/poll)	1+ (epoll_wait + read/write)	1 (batched submit + CQ poll)
Per-operation overhead	blocking read/write	blocking read/write	ring enqueue only
Fixed buffers	No	No	Yes (registered)
Kernel polling (no userspace wakeup)	No	No	Yes (SQPOLL)
Works on regular files	Yes (always async-ready)	Yes (always ready)	Yes (Linux 5.19+)
Multi-shot accept	No	No	Yes
Chained operations	No	No	Yes (IOSQE_IO_LINK)
Min kernel version	any	2.5.44 (2002)	5.1 (2019)

In microbenchmarks with 1M small reads/writes per second, io_uring shows 2–5x higher throughput than epoll due to syscall reduction. In real workloads the gap narrows: NIC interrupts, TCP stack processing, and page cache behavior dominate. io_uring's advantage is largest for: disk I/O (bypassing page cache), high-frequency small operations, and workloads where the CPU cost of syscalls is measurable.

Evolution Timeline

1991

Linux 0.01 — basic blocking read/write, select(2)

1994

poll(2) added — no FD bit limit, array-based

1999

C10K problem identified — select/poll can't scale past ~1k connections

2002

epoll merged Linux 2.5.44 — O(1) notification, kernel-side FD set

2003

FreeBSD 5.1 — kqueue production-ready

2005

Linux 2.6.17 — epoll edge-triggered (EPOLLET) finalized

2009

Linux 2.6.32 — io_uring RFC posted by Jens Axboe

2013

Linux 3.9 — SO_REUSEPORT for thundering herd-free load balancing

2019

Linux 5.1 — io_uring merged, SQ/CQ rings, basic read/write/accept

2020

Linux 5.6 — io_uring supports regular files (not just sockets)

2022

Linux 5.19 — io_uring supports buffered file I/O (page cache)

2025

io_uring ops cover most syscall categories; widely used in databases (PolarDB, MySQL, etc.)

Tradeoffs

When io_uring wins

Million-ops-per-second servers (databases, proxies, KV stores)
Workloads with many small I/Os where syscall overhead dominates
File I/O that needs to be truly async (no thread pool)
Modern userspace runtimes (Tokio, monoio, glommio) lean on it heavily
Kernel polling (SQPOLL) for lowest possible latency

When epoll is the right choice

You need to support kernels older than 5.10
Security-conscious environments (io_uring has had multiple CVEs; some hardened distros disable it)
Code simplicity and ecosystem maturity matter more than 30% throughput
Your bottleneck is CPU or network, not syscalls
Cross-platform code (epoll on Linux, kqueue on macOS)

When blocking I/O (threads) wins

Low concurrency (<5,000 connections) — code is 10x simpler
CPU-bound work per connection (encryption, computation) — threads keep CPU busy
Green-threaded languages (Go, Java virtual threads) — the runtime multiplexes efficiently
Debugging is easier with linear, blocking call stacks

Frequently Asked Questions

Why is io_uring faster than epoll?

epoll requires a syscall per readiness check (or per batch of FDs returned), and once a FD is ready you still need a syscall to actually do the read/write. io_uring batches operations: you submit dozens of reads/writes by appending entries to the submission queue ring (no syscall), then make a single io_uring_enter() to tell the kernel to process them. Completions appear in the completion queue ring, which you poll without a syscall. For high-throughput servers (millions of ops/sec), the syscall savings alone are 30-50%. Plus io_uring supports operations epoll can't model — fsync, accept, splice, openat, all asynchronously.

Why did POSIX AIO never catch on?

Linux's POSIX AIO (aio_read, aio_write) is implemented in glibc as a thread pool that calls blocking pread/pwrite — it's not actually asynchronous in the kernel sense. The kernel's native AIO (io_submit/io_getevents) only works on direct I/O on raw block devices; buffered file I/O silently falls back to synchronous. This made it unusable for most workloads, so people used thread pools or epoll instead. io_uring is what AIO should have been from day one: a generic asynchronous syscall interface that works on all FD types.

What's the difference between select, poll, and epoll?

select uses a fixed-size bitmap (typically capped at 1024 FDs by FD_SETSIZE) and copies it kernel-userspace each call — O(N) per call where N is the highest FD number. poll uses an array of pollfd structs without the size limit, but still copies the whole array each call — O(N). epoll keeps the FD set inside the kernel (epoll_ctl adds/removes), so each epoll_wait only returns the ready FDs — O(1) amortized. For 100,000 idle connections with one active, select walks all 100k bits, poll walks all 100k structs, epoll returns immediately with the one ready FD.

When should I use blocking I/O?

When concurrency is low (a few hundred connections), threads are cheap, and the code is much simpler with synchronous flow. Modern kernels handle thousands of threads fine; the C10K problem is largely solved by epoll/io_uring, but the 'just spawn a thread per connection' approach works fine up to ~10k connections on a beefy box. Languages with cheap green threads (Go, Java virtual threads) make this even more attractive — you write blocking code and the runtime multiplexes onto epoll under the hood.

How does Windows IOCP compare to io_uring?

IOCP (I/O Completion Ports) has been the Windows model since NT 3.5 and looks similar to io_uring on the surface: you start an async operation, the kernel queues a completion to a port when done, you dequeue completions. The difference is API breadth: IOCP works on virtually every Windows handle (files, sockets, pipes, named pipes) from day one. io_uring inherited a fragmented Linux landscape and is only now (2020+) generalizing to all FD types. Performance-wise they're in the same ballpark; both avoid per-op syscalls.

What is signal-driven I/O and why don't people use it?

Signal-driven I/O (F_SETOWN + SIGIO) tells the kernel to send a signal whenever a FD becomes ready. In theory you do nothing until SIGIO fires, then you read. In practice: signals are global to a process, hard to multiplex, lose information when delivered too fast (signals coalesce), and interact badly with locks and async-signal-unsafe functions. epoll dominates because it provides the same edge-triggered semantics without the signal-handling minefield.

What is the thundering herd problem?

Before Linux 2.6, when multiple processes blocked on accept() for the same listening socket, all were woken when a connection arrived — even though only one would get it. This wasted wakeups. Linux 2.6 introduced a对策: only one waiter is woken per connection attempt. However, the problem resurfaces with epoll's EPOLLET mode on accept, where a single SYN triggers many-ready events if you're not careful. SO_REUSEPORT (Linux 3.9) solves load-balancing at the socket level by giving each worker its own listening socket, eliminating the accept queue contention entirely.

What operations does io_uring support?

io_uring started with a fixed initial set (read, write, recv, send, accept, connect, openat, close, statx, fadvise, madvise, fsync, sync_file_range, write_fixed, read_fixed) and has grown every kernel release. Linux 5.19 added io_uring on regular files (not just sockets). Linux 6.0 added socket shutdown. As of 2025 most file and network operations are supported, plus mkdir, rename, unlink, and even clone/thread creation in some configurations. Search 'io_uring operations' or /usr/include/linux/io_uring.h for the full list of IORING_OP_*.