Reddit - r/programming 1h ago

io_uring Feels Illegal

A Visual Walkthrough of How io_uring Works

io_uring is a Linux kernel interface for asynchronous I/O that feels almost illegal in its power and design. At its core, it operates through shared rings between userspace and the kernel, enabling zero-copy submission and completion of I/O operations.

Shared Rings: SQEs and CQEs

The fundamental building blocks are two shared ring buffers:

Submission Queue (SQ) - holds Submission Queue Entries (SQEs), which describe I/O operations
Completion Queue (CQ) - holds Completion Queue Events (CQEs), which report results

These rings are mapped into both userspace and kernel memory, eliminating the need for system calls to transfer data. The kernel directly reads SQEs and writes CQEs to the same memory region.

Batching Operations

Instead of issuing one system call per operation, io_uring batches multiple SQEs into a single io_uring_enter() syscall. This dramatically reduces context switch overhead:

// Submit 64 operations with one syscall
struct io_uring_sqe *sqe;
int ret = io_uring_submit(&ring);  // submits all pending SQEs

SQPOLL Mode

For latency-sensitive workloads, SQPOLL mode spawns a kernel thread that polls the submission queue continuously:

Eliminates the need for io_uring_enter() syscalls entirely
The kernel thread sleeps when idle and wakes on new submissions
Reduces latency to near-zero for high-frequency operations

Multishot Operations

Multishot operations allow a single SQE to produce multiple CQEs:

Example: IORING_OP_ACCEPT can accept multiple connections from one submission
Example: IORING_OP_POLL_ADD can fire repeatedly for recurring events
Each completion carries a flag indicating whether the operation is done or will produce more events

Linked Operations

Linked operations create chains where one operation starts only after the previous completes:

struct io_uring_sqe *sqe1, *sqe2;

// First operation: open file
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_openat(sqe1, dir_fd, path, flags, mode);

// Second operation: read from file (linked to first)
sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe2, fd, buf, size, offset);
sqe2->flags |= IOSQE_IO_LINK;  // link to previous SQE

Fixed Buffers and Provided Buffers

Fixed buffers are pre-registered memory regions that avoid page pinning overhead:

struct iovec iov[] = { /* ... */ };
io_uring_register_buffers(&ring, iov, 1);

Provided buffers allow the kernel to dynamically select which buffer to use for operations like IORING_OP_RECV:

Userspace donates a pool of buffers
The kernel picks one automatically, avoiding buffer management in userspace
Particularly useful for network servers handling variable-length messages

Tradeoffs and Power

The interface exposes raw kernel power with significant tradeoffs:

Memory ordering - userspace must use proper memory barriers when accessing shared rings
Complexity - the API surface is vast, with many flags and modes that interact in subtle ways
Portability - io_uring is Linux-specific and requires kernel 5.1+
Resource management - fixed buffers and registered files consume kernel resources that must be explicitly freed

The design feels illegal because it bypasses decades of Unix I/O conventions. No read(), no write(), no select(), no epoll() - just shared memory and atomic operations between userspace and kernel, operating at speeds that traditional interfaces cannot match.

Read on Reddit - r/programming ↗ ← Back to News