io_uring Feels Illegal
A Visual Walkthrough of How io_uring Works
io_uring is a Linux kernel interface for asynchronous I/O that feels almost illegal in its power and design. At its core, it operates through shared rings between userspace and the kernel, enabling zero-copy submission and completion of I/O operations.
Shared Rings: SQEs and CQEs
The fundamental building blocks are two shared ring buffers:
- Submission Queue (SQ) - holds Submission Queue Entries (SQEs), which describe I/O operations
- Completion Queue (CQ) - holds Completion Queue Events (CQEs), which report results
These rings are mapped into both userspace and kernel memory, eliminating the need for system calls to transfer data. The kernel directly reads SQEs and writes CQEs to the same memory region.
Batching Operations
Instead of issuing one system call per operation, io_uring batches multiple SQEs into a single io_uring_enter() syscall. This dramatically reduces context switch overhead:
// Submit 64 operations with one syscall
struct io_uring_sqe *sqe;
int ret = io_uring_submit(&ring); // submits all pending SQEs
SQPOLL Mode
For latency-sensitive workloads, SQPOLL mode spawns a kernel thread that polls the submission queue continuously:
- Eliminates the need for
io_uring_enter()syscalls entirely - The kernel thread sleeps when idle and wakes on new submissions
- Reduces latency to near-zero for high-frequency operations
Multishot Operations
Multishot operations allow a single SQE to produce multiple CQEs:
- Example:
IORING_OP_ACCEPTcan accept multiple connections from one submission - Example:
IORING_OP_POLL_ADDcan fire repeatedly for recurring events - Each completion carries a flag indicating whether the operation is done or will produce more events
Linked Operations
Linked operations create chains where one operation starts only after the previous completes:
struct io_uring_sqe *sqe1, *sqe2;
// First operation: open file
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_openat(sqe1, dir_fd, path, flags, mode);
// Second operation: read from file (linked to first)
sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe2, fd, buf, size, offset);
sqe2->flags |= IOSQE_IO_LINK; // link to previous SQE
Fixed Buffers and Provided Buffers
Fixed buffers are pre-registered memory regions that avoid page pinning overhead:
struct iovec iov[] = { /* ... */ };
io_uring_register_buffers(&ring, iov, 1);
Provided buffers allow the kernel to dynamically select which buffer to use for operations like IORING_OP_RECV:
- Userspace donates a pool of buffers
- The kernel picks one automatically, avoiding buffer management in userspace
- Particularly useful for network servers handling variable-length messages
Tradeoffs and Power
The interface exposes raw kernel power with significant tradeoffs:
- Memory ordering - userspace must use proper memory barriers when accessing shared rings
- Complexity - the API surface is vast, with many flags and modes that interact in subtle ways
- Portability - io_uring is Linux-specific and requires kernel 5.1+
- Resource management - fixed buffers and registered files consume kernel resources that must be explicitly freed
The design feels illegal because it bypasses decades of Unix I/O conventions. No read(), no write(), no select(), no epoll() - just shared memory and atomic operations between userspace and kernel, operating at speeds that traditional interfaces cannot match.
Comments
No comments yet. Start the discussion.