Kernel Bypass Networking: DPDK, io_uring, and the RDMA Revolution

2025-02-10 · Leonardo Benicio

Summary

Dive into how modern systems escape the kernel networking stack for microsecond-scale performance: DPDK's poll-mode drivers, io_uring's submission rings, RDMA's one-sided operations, and the trade-offs each approach demands.

Contents

The Linux kernel’s networking stack is a monument of software engineering. It handles TCP congestion control, IP fragmentation, netfilter hooks, iptables rules, socket buffers (sk_buffs), NAPI polling, softirq processing, and dozens of other concerns that have accreted over three decades of Internet evolution. It is also — and this is the problem — far too slow for modern workloads. When your service-level objective demands sub-100-microsecond tail latency or your NIC is pushing 100 million packets per second, the kernel network stack becomes the bottleneck, and the only way forward is to get the kernel out of the way.

Kernel bypass is not a single technique but a family of approaches united by a common philosophy: move the data path into userspace, eliminate system calls, and let the application control its own destiny. This post covers the three dominant paradigms — DPDK’s poll-mode drivers for raw packet processing, io_uring’s submission/completion rings for efficient asynchronous I/O, and RDMA’s one-sided operations for remote memory access — with enough technical depth that you can understand not just how they work but when to use each and what hidden costs they impose.

1. The Problem: Why the Kernel Stack Is Slow

Let us quantify the problem before discussing solutions. A single system call — say, sendmsg() on a connected TCP socket — traverses a surprising amount of kernel code. A simplified trace:

userspace:  sendmsg(fd, buf, len)  →  syscall (trap to kernel)
kernel:     __sys_sendmsg()
              → sock_sendmsg()
                → inet_sendmsg()
                  → tcp_sendmsg()
                    → tcp_sendmsg_locked()
                      → tcp_push()
                        → __tcp_push_pending_frames()
                          → tcp_write_xmit()
                            → __tcp_transmit_skb()
                              → ip_queue_xmit()
                                → __ip_local_out()
                                  → dst_output()
                                    → dev_queue_xmit()
                                      → __dev_xmit_skb()
                                        → sch_handle_egress (qdisc)
                                          → dev_hard_start_xmit()
                                            → ndo_start_xmit()  [driver]

Each function call is cheap, but the aggregate is not: hundreds of CPU cycles per packet for the software path alone, before factoring in cache misses, TLB flushes, context switches, interrupt handling, and the overhead of copy_from_user / copy_to_user for crossing the kernel-userspace boundary. On a modern CPU running at 3 GHz, 500 cycles is 167 nanoseconds — tolerable for a few thousand packets per second, devastating at 10 million.

1.1 The 100-Microsecond Budget

In a distributed system, a single user-facing request might fan out to dozens of internal RPCs. If each RPC takes 100 microseconds of network time, and 20 of them are on the critical path, you are at 2 milliseconds before the application has done any work. The kernel’s networking overhead — context switches, interrupt processing, buffer copies — can easily consume 10-30 microseconds per RPC. When your budget is 100 microseconds, a 30-microsecond tax is existential.

This is why kernel bypass matters: it is not about shaving nanoseconds off a rare code path. It is about reclaiming the 20-40% of your latency budget that the kernel consumes, enabling systems that were previously infeasible.

1.2 The Four Sources of Kernel Overhead

System call overhead: The syscall/sysret instruction pair costs ~100-200 cycles on modern x86, plus the Spectre/Meltdown mitigations (KPTI, retpolines) that add another ~50-200 cycles. Multiply by thousands of I/O operations per second and this dominates.
Data copying: The kernel copies data from userspace buffers to kernel buffers (sk_buffs) and back. For large messages, this copy saturates memory bandwidth; for small messages, the copy setup overhead dominates.
Interrupt-driven processing: Traditional NICs interrupt the CPU on packet arrival. Each interrupt triggers a context switch (if the interrupted thread was userspace), TLB flush, and cache pollution. NAPI mitigates this by switching to polling under high load, but the transition is a heuristic, not a guarantee.
Lock contention: The kernel’s data structures (socket hash tables, routing tables, netfilter chains) are protected by locks that become contended under high connection rates. Even with RCU and per-CPU data structures, the overhead is measurable at scale.

2. DPDK: Userspace Packet Processing

The Data Plane Development Kit (DPDK) is the most radical form of kernel bypass: it moves the entire packet processing pipeline — from the NIC to the application — into userspace. DPDK applications run in a tight polling loop, constantly checking NIC receive rings for new packets, processing them, and placing responses on transmit rings. There are no interrupts, no system calls, and no kernel involvement in the data path whatsoever.

2.1 Architecture: UIO, VFIO, and Huge Pages

DPDK relies on three kernel facilities that provide controlled hardware access to userspace:

UIO (Userspace I/O) / VFIO (Virtual Function I/O): These kernel subsystems map PCI device BARs (Base Address Registers) into userspace, allowing DPDK to directly read and write NIC control registers. VFIO is the modern choice — it provides IOMMU protection, preventing a buggy DPDK application from corrupting physical memory via DMA misconfiguration.
Huge pages: DPDK allocates packet buffers from huge pages (2MB or 1GB) rather than standard 4KB pages. This reduces TLB misses — a single 2MB huge page covers 512 standard pages, and the TLB is limited (typically ~64-256 entries for 4KB pages). For a DPDK application processing millions of packets per second, TLB miss overhead is significant.
RSS (Receive Side Scaling): DPDK configures the NIC to hash incoming packets across multiple receive queues, each pinned to a dedicated CPU core. This eliminates lock contention: each core processes its own queue independently.

The result is a zero-copy, zero-syscall, zero-interrupt data path. Packets arrive in NIC DMA buffers mapped into userspace; DPDK’s poll-mode driver detects them by checking a doorbell register; the application processes the packet in place and either forwards it or drops it — all without the kernel being involved.

2.2 Poll-Mode Drivers (PMDs)

A DPDK PMD is a userspace driver that directly controls a specific NIC model. Intel provides PMDs for its ixgbe (10G), i40e (40G), and ice (100G) controllers; Mellanox provides PMDs for ConnectX series adapters; and there are community PMDs for various other NICs.

A typical PMD receive loop:

while (running) {
    uint16_t nb_rx = rte_eth_rx_burst(port_id, queue_id, mbufs, BURST_SIZE);
    for (int i = 0; i < nb_rx; i++) {
        struct rte_mbuf *pkt = mbufs[i];
        // Process packet: parse headers, lookup flow table, transform...
        process_packet(pkt);
        // Forward, drop, or enqueue for TX
        rte_eth_tx_burst(port_id, tx_queue_id, &pkt, 1);
    }
}

Key observations:

rte_eth_rx_burst() polls the NIC’s receive descriptor ring, retrieves up to BURST_SIZE packets, and returns immediately — even if zero packets arrive. The CPU spins, consuming 100% of a core regardless of load.
There is no flow control: if the application cannot keep up, packets are silently dropped when the receive ring fills. DPDK applications must be carefully sized to handle peak load.
The programming model is run-to-completion: each packet is fully processed on a single core. This simplifies state management (no context switches) but requires the processing function to be non-blocking.

2.3 The Cost: Lost Kernel Services

DPDK’s raw performance comes at a steep price: you lose the entire kernel networking stack. No TCP/IP — you must either implement your own (like mTCP, Seastar’s native stack, or F-Stack’s FreeBSD-derived stack) or link against a userspace TCP library. No netfilter/iptables — security filtering must be done in the application. No standard socket API — applications must be rewritten against DPDK’s rte_mbuf and rte_eth_* APIs. No kernel TCP congestion control — you are responsible for implementing Reno, CUBIC, or BBR yourself.

This is why DPDK is primarily used in infrastructure software — software routers (VPP/fd.io), load balancers (Katran, Maglev), virtual switches (OVS-DPDK), and telco gateways — where the performance gain justifies the development cost. General-purpose applications rarely adopt DPDK directly; instead, they use higher-level frameworks that build on it.

2.4 lcores and Run-to-Completion Scheduling

DPDK’s execution model is built around lcores — logical cores that are pinned to specific hardware threads. Each lcore runs an infinite loop: poll for work, process work, repeat. This is the same model as the kernel’s per-CPU kthreads, but moved to userspace and specialized for networking.

The run-to-completion model has important implications:

Cache locality: By processing a packet from arrival to departure on a single core, all data stays in L1/L2 cache. Context switching would evict this data.
No preemption: An lcore never yields voluntarily. If a packet takes too long to process, subsequent packets are delayed or dropped. This puts a hard upper bound on per-packet processing time.
Pipeline model alternative: Some DPDK applications split processing into stages (receive → classify → transform → transmit) and pass packets between lcores via software rings. This pipelining increases throughput at the cost of inter-core communication latency.

2.5 DPDK Performance Numbers

To ground the discussion, typical DPDK performance on a modern x86 server with a 100GbE NIC:

Throughput: 100-148 million packets per second (Mpps) for minimum-size (64-byte) packets, saturating a 100GbE link. The kernel stack typically maxes out at 1-5 Mpps per core.
Latency: 10-20 microseconds for a simple L3 forward (NIC → DPDK → NIC), versus 30-100 microseconds through the kernel.
CPU efficiency: ~100-200 CPU cycles per packet in DPDK versus 2,000-10,000 cycles through the kernel stack.

These numbers explain why every high-frequency trading firm, CDN edge node, and telco 5G infrastructure uses DPDK. When a single microsecond of latency advantage translates to millions in revenue — as it does in electronic market making — raw DPDK performance is not optional; it is existential.

3. io_uring: Submission Rings and the Kernel’s Counterattack

DPDK bypasses the kernel entirely. io_uring, introduced by Jens Axboe in Linux 5.1, takes the opposite approach: it makes kernel I/O so efficient that bypass is unnecessary for many workloads. io_uring is a general-purpose asynchronous I/O interface that works for files, sockets, and any other file descriptor. It achieves near-DPDK-level efficiency for network I/O while retaining the kernel’s TCP stack, filesystems, and security model.

3.1 The SQ/CQ Ring Architecture

io_uring operates through two lock-free ring buffers shared between userspace and the kernel:

Userspace                  Kernel
┌─────────────┐           ┌──────────────┐
│  Submission  │  ──────►  │  Kernel      │
│  Queue (SQ)  │   SQEs    │  Processing  │
│              │           │              │
│  Completion  │  ◄──────  │  (async      │
│  Queue (CQ)  │   CQEs    │   workers)   │
└─────────────┘           └──────────────┘

Submission Queue (SQ): A ring buffer where userspace writes Submission Queue Entries (SQEs) — descriptions of I/O operations to perform. An SQE specifies the operation code (IORING_OP_READ, IORING_OP_WRITE, IORING_OP_ACCEPT, IORING_OP_SENDMSG, etc.), the file descriptor, buffer address, offset, and length.
Completion Queue (CQ): A ring buffer where the kernel writes Completion Queue Entries (CQEs) — results of completed operations. A CQE contains the operation’s return code, the userspace-supplied user data (for correlation), and any flags.

The magic is that both rings are mapped into userspace with read-write access to both sides. Userspace can submit operations by writing SQEs into the SQ and updating a tail pointer — no system call needed. The kernel polls the SQ head pointer, processes SQEs in batches, and writes CQEs into the CQ. Userspace polls the CQ for completions, again without system calls.

3.2 Submission Batching and the SQ Polling Mode

The primary overhead of traditional asynchronous I/O (POSIX AIO, libaio) is the system call per submission and per completion. io_uring eliminates both:

Submission batching: Userspace appends multiple SQEs and then performs a single io_uring_enter() system call (or, in SQPOLL mode, zero system calls) to submit them all at once.
SQ Polling (SQPOLL): A kernel thread spins, continuously polling the SQ for new entries. Userspace writes SQEs and updates the tail pointer — the kernel thread picks them up with no userspace transition. This is the closest analogue to DPDK’s poll mode, but inside the kernel.
IORING_SETUP_SQPOLL: When the io_uring is created with this flag, the kernel dedicates a thread to polling the SQ. The trade-off: one CPU core spins at 100% (like DPDK), but the submission path has near-zero latency.

For network workloads using IORING_OP_SENDMSG and IORING_OP_RECVMSG with SQPOLL, io_uring can achieve sub-10-microsecond round-trip latencies for small messages — within a factor of 2-3 of DPDK, but with all the benefits of the kernel TCP stack.

3.2.1 io_uring Zero-Copy Send

Linux 5.20 (later renamed to 6.0) introduced IORING_OP_SEND_ZC, a zero-copy send operation that eliminates the userspace-to-kernel copy for outgoing data. With zero-copy send, the application registers a buffer region with the io_uring, and the kernel maps those pages directly into the socket’s send path. When the NIC supports scatter-gather DMA and TCP segmentation offload (TSO), the data flows from userspace pages to the wire without an intermediate copy.

The semantics are nuanced: the buffer cannot be reused until the kernel signals completion via a CQE, which occurs only after the NIC has finished DMAing the data. Applications must manage buffer lifetimes carefully — a premature overwrite corrupts the in-flight transmission. The io_uring provides notification via the IOSQE_CQE_SKIP_SUCCESS flag, which suppresses the CQE for successful completions, allowing the application to batch-free buffers when the completion ring indicates the associated tag.

3.2.2 Multi-Shot Operations and Ring Mapped Buffers

Another innovation in io_uring is multi-shot operations. A single SQE can request multiple completions — for instance, IORING_OP_MULTISHOT_ACCEPT accepts connections in a loop without requiring a new SQE for each one, with each accepted connection generating a separate CQE. This is remarkably efficient for high-connection-rate servers.

Combined with ring-mapped buffers (IORING_SETUP_BUF_RING), where the kernel directly maps a ring of pre-allocated buffers and picks the next free one for each receive operation, io_uring eliminates buffer management overhead entirely. The application provides a pool of buffers; the kernel consumes them from the ring and returns them via CQEs. This is the closest the kernel has come to DPDK’s buffer management model, and it enables sustained throughput of millions of messages per second per core.

3.3 Fixed Files and Fixed Buffers

Two additional optimizations push io_uring performance even higher:

Fixed files (IORING_SETUP_ATTACH_WQ): Normally, each SQE refers to a file descriptor by number, and the kernel must look up the struct file from the process’s file table — an atomic operation with cache implications. With fixed files, the io_uring pre-registers an array of file descriptors at setup time, and SQEs reference them by index. The lookup becomes a simple array access.
Fixed buffers (IORING_SETUP_BUFFER_SELECT): Normally, reads and writes require the kernel to pin userspace memory pages. With fixed buffers, userspace pre-registers a pool of buffers; SQEs reference buffers by ID, and the kernel can DMA directly into them without per-operation pinning.

These optimizations eliminate the remaining per-operation overhead, bringing io_uring performance within striking distance of DPDK for many workloads.

3.4 Linked Operations and Async Chains

io_uring supports linked SQEs: a chain of operations where each SQE starts only after the previous one completes. The IOSQE_IO_LINK flag creates a dependency chain. If any operation in the chain fails, subsequent operations are cancelled — providing transactional semantics across multiple I/O steps.

This is a surprisingly powerful abstraction. Consider a proxy server that reads from an upstream socket and writes to a downstream socket:

// SQE 1: recv from upstream
sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(sqe, upstream_fd, buf, BUF_SIZE, 0);
sqe->flags |= IOSQE_IO_LINK;

// SQE 2: send to downstream (executes only after SQE 1 completes successfully)
sqe = io_uring_get_sqe(&ring);
io_uring_prep_send(sqe, downstream_fd, buf, 0, 0);
sqe->user_data = SOME_COOKIE;

Before io_uring, implementing this without blocking required a state machine, callbacks, or an event loop with explicit buffer management. With linked SQEs, the kernel chains the operations, and the application just waits for the final CQE.

3.5 io_uring vs. DPDK: When to Use Which

The choice between io_uring and DPDK is not about raw speed — DPDK wins that by a factor of 2-5x. It is about the development model:

Use DPDK when you need to process millions of packets per second per core, you are willing to write your own TCP stack or use a userspace TCP library, and you can dedicate entire CPU cores to polling. This is the domain of infrastructure software: load balancers, firewalls, virtual switches, CDN edge nodes.
Use io_uring when you need microsecond-scale I/O latency but want to keep the kernel’s TCP stack, filesystem, and security model, and you cannot dedicate full cores to spinning. This is the domain of databases, storage systems, proxies, and application servers.

The beauty of io_uring is that it narrows the gap. A well-tuned io_uring-based server can achieve 80-90% of DPDK’s throughput with 10% of the development complexity. For most applications, that is the right trade-off.

4. RDMA: Remote Direct Memory Access

RDMA represents a third paradigm: bypass not just the kernel but also the remote CPU. With RDMA, one machine’s NIC can read from or write to another machine’s memory directly, without involving the remote CPU at all. This is not just a performance optimization — it fundamentally changes the programming model for distributed systems.

4.1 The RDMA Primitives

RDMA provides two classes of operations:

Two-sided (send/recv): These resemble traditional messaging. The sender posts a SEND work request specifying a local buffer; the receiver must have pre-posted a RECV work request with a buffer. When the SEND completes, the data has been written into the receiver’s buffer.

One-sided (read/write/atomic): These are the revolutionary operations. The initiator specifies a remote virtual address and a local buffer; the remote NIC performs the memory access without the remote CPU’s knowledge. One-sided READ pulls data from remote memory to local memory; one-sided WRITE pushes data from local memory to remote memory; atomic operations (compare-and-swap, fetch-and-add) perform atomic read-modify-write on remote memory.

The one-sided operations are what make RDMA so powerful — and so different from all previous networking technologies.

4.2 Queue Pairs and Completion Queues

RDMA communication is structured around Queue Pairs (QPs). Each QP has:

Send Queue (SQ): Work requests for SEND, RDMA WRITE, RDMA READ, and atomic operations.
Receive Queue (RQ): Work requests for RECV operations (to receive incoming SENDs).

Both queues are in userspace memory, mapped directly by the RDMA NIC (RNIC). To initiate an operation:

The application constructs a Work Queue Entry (WQE) in the SQ, specifying the operation type, local buffer, remote address (for one-sided ops), and a Work Request ID (WRID) for correlation.
The application writes to a doorbell register on the RNIC — a memory-mapped I/O write that notifies the NIC of new work.
The RNIC DMA-reads the WQE from the SQ, executes the operation (DMA transfers over the network), and writes a Completion Queue Entry (CQE) into the Completion Queue (CQ).
The application polls the CQ for completions, matching CQEs to requests via the WRID.

The entire path — from WQE posting to CQE completion — involves zero system calls and zero CPU involvement on the remote side (for one-sided operations). This is the hardware equivalent of DPDK: the RNIC is the poll-mode driver, and the application is fully asynchronous.

4.3 Memory Registration: The Hidden Cost

RDMA’s Achilles’ heel is Memory Registration (MR). Before the RNIC can access a region of userspace memory, that region must be registered — the kernel pins the pages (preventing them from being swapped out), maps the virtual-to-physical translations into the RNIC’s IOMMU, and returns a memory key (lkey/rkey) that the application includes in work requests.

Memory registration is expensive — hundreds of microseconds for large regions — and historically required a system call per registration. This tension between “register everything to avoid per-operation overhead” and “register on demand to avoid memory pinning” has shaped RDMA system design for two decades.

Modern solutions:

On-Demand Paging (ODP): The RNIC can handle page faults, requesting that the kernel pin a page only when it is actually accessed. This eliminates pre-registration but adds latency on first access.
Memory Windows: Sub-regions of a larger MR that can be rebound without re-registration.
FRWR (Fast Registration Work Requests): A work request that registers memory asynchronously from the SQ, avoiding the system call but still requiring a round-trip through the NIC.
Implicit ODP: The RNIC assumes all process memory is accessible and uses the IOMMU to translate addresses dynamically, with page faults handled in hardware (Intel’s I/O Memory Management Unit) or firmware.

4.4 RDMA Transport: RC, UD, and Dynamically Connected

RDMA supports multiple transport modes, each with different trade-offs:

Reliable Connection (RC): The classic RDMA transport. A QP is connected to exactly one remote QP. Messages are delivered reliably and in order. This requires per-connection state on the RNIC, limiting scalability — a single RNIC might support 10,000-100,000 RC QPs before running out of internal SRAM.
Unreliable Datagram (UD): A QP can communicate with any remote QP. Messages are unreliable (may be dropped) and unordered. Scales to millions of QPs but requires the application to handle retransmission and ordering.
Dynamically Connected (DC): A hybrid: QPs can change their remote peer dynamically, achieving RC reliability with UD-level scalability at the cost of connection setup latency.

The choice of transport profoundly affects distributed system design. A key-value store using RC QPs must manage connection pools carefully; a key-value store using UD QPs must implement retry logic akin to TCP but can scale to millions of clients per server.

4.5 RDMA in Practice: FaRM, HERD, and the Disaggregated Data Center

RDMA has enabled a new class of systems:

FaRM (Microsoft Research): A distributed computing platform that uses RDMA for both messaging and memory access, achieving 10-20 microsecond RPC latencies across a cluster. FaRM’s transaction protocol uses RDMA writes for log shipping and RDMA reads for object access, entirely bypassing the remote CPU for the data path.
HERD (Microsoft Research): A key-value store that achieves 26 million RPCs per second on a single machine using RDMA UD with one-sided READs. The secret is eliminating the remote CPU entirely: all reads are served by the RNIC directly from registered memory, and the CPU only handles writes.
Disaggregated memory (various): The idea that memory and CPU can be separate resources connected via RDMA. Systems like LegoOS, INFINISWAP, and AIFM use RDMA to access remote memory as if it were local, with page-fault-driven migration.

The limiting factor in RDMA systems is often not the network but the PCIe bus: a single 100GbE RNIC saturates a PCIe 3.0 x16 link (~128 Gbps). Modern RNICs (ConnectX-6, ConnectX-7) use PCIe 4.0 and 5.0 to push 200-400 Gbps, and the next generation will demand PCIe 6.0 — a reminder that kernel bypass shifts the bottleneck from software to hardware, but does not eliminate it.

4.6 Safety, Isolation, and the Case Against RDMA

RDMA’s one-sided operations create serious safety concerns. When a machine can write directly to another machine’s memory:

Memory corruption: A buggy client that writes to the wrong address corrupts the server’s memory. Traditional network programming isolates failures in the network layer; RDMA blurs the boundary.
Denial of service: A misbehaving client can flood the RNIC with RDMA requests, exhausting PCIe bandwidth and starving other clients.
Security: The RNIC’s memory translation tables (Memory Keys, Protection Domains) are complex and have been a source of vulnerabilities in RDMA deployments.

These concerns have limited RDMA adoption to controlled environments — HPC clusters, managed data centers, and cloud providers’ internal networks. RDMA over Converged Ethernet (RoCE) extends RDMA to standard Ethernet fabrics, but the safety concerns remain. The industry is still searching for the right isolation primitives to make RDMA safe for multi-tenant cloud deployments.

5. Kernel Bypass and the CPU Side: Cache Locality and NUMA

An often-overlooked dimension of kernel bypass is its interaction with CPU cache hierarchy and NUMA topology. The traditional kernel stack’s per-packet memory allocations (sk_buffs, socket buffers) are spread across memory, causing cache misses. Kernel bypass techniques give applications control over memory placement, which can be as important for performance as avoiding system calls.

5.1 Memory Pool Design

DPDK’s rte_mempool and io_uring’s buffer rings both pre-allocate memory from specific NUMA nodes, ensuring that packet buffers reside in memory local to the processing core. A DPDK application on a dual-socket server typically creates separate mempools for each socket, and lcores only allocate from their local pool. This eliminates the remote NUMA access penalty — 50-100 additional nanoseconds per access, which at 100 Mpps translates to 5-10 microseconds of pure memory latency per second of processing.

RDMA goes further: the RNIC itself is a NUMA-affine PCIe device. If the RNIC is on socket 0’s PCIe root complex, RDMA operations that target memory on socket 1 incur a cross-socket DMA penalty. High-performance RDMA applications pin memory on the same NUMA node as the RNIC and use libnuma to enforce affinity.

5.2 Cache Line Bouncing and False Sharing

In a traditional socket-based server, the kernel’s struct sock and associated struct sk_buff are shared between the application thread (via system calls) and the kernel’s softirq processing. This causes cache line bouncing — the same cache line ping-pongs between cores, each invalidation costing tens of cycles. Kernel bypass eliminates this sharing: in DPDK, the poll-mode driver and the application logic run on the same core; in io_uring with SQPOLL, the kernel thread and the application are on different cores but share only the well-defined SQ/CQ ring boundaries, minimizing cache-line contention.

6. Comparing the Three Paradigms

Let us compare DPDK, io_uring, and RDMA across the dimensions that matter for system design:

                     DPDK          io_uring (SQPOLL)    RDMA (RC)
─────────────────────────────────────────────────────────────────
Data path            Userspace     Kernel               Hardware (NIC)
Syscall per I/O      0             0 (batched)           0
CPU overhead/pkt     ~100-200 cyc  ~300-500 cyc         ~50-100 cyc (hw offload)
Remote CPU           0 (local)     0 (local)             0 (one-sided)
TCP stack            Userspace     Kernel               N/A (custom)
Max throughput/core  100 Mpps      5-10 Mpps             200 Mpps (hw limited)
Latency (small msg)  2-5 µs        8-15 µs               1-3 µs
Memory registration  No            No                    Yes (MRs)
Connection model     N/A (raw)     FD-based              QP-based
Security model       Custom        Kernel (seccomp,etc)  Protection Domains
Development effort   Very high     Medium                High

The table makes clear: each technology trades development complexity for performance. DPDK is the fastest software path but requires building your own network stack. io_uring is the pragmatic middle ground — 5-10x faster than traditional sockets with a fraction of DPDK’s complexity. RDMA is the ultimate in performance but demands hardware support, careful memory management, and acceptance of the safety trade-offs.

7. When Not to Bypass the Kernel

Kernel bypass is not a universal good. There are scenarios where the traditional kernel stack is the right choice:

Low connection rates, high message sizes: If your application handles hundreds of connections, not millions, and each message is kilobytes to megabytes, the kernel’s TCP stack is perfectly adequate. A single sendmsg() of 64KB amortizes the system call overhead over the data transfer time.
Complex protocol requirements: If you need TLS, HTTP/2, WebSocket, or other protocol layers, the kernel’s kTLS (kernel TLS offload) or userspace libraries built on sockets (OpenSSL, BoringSSL) are mature and battle-tested. Reimplementing TLS on top of DPDK or RDMA is a multi-engineer-year effort.
Kernel features you cannot replicate: TCP congestion control (CUBIC, BBR), netfilter/iptables, conntrack, cgroups I/O accounting, and eBPF-based observability all live in the kernel. Bypassing the kernel means losing these — or reimplementing them.
Deployment constraints: DPDK requires dedicated CPU cores, 1GB huge pages, and VFIO device assignment, none of which are available in typical container environments (though Kubernetes with the Multus CNI and SR-IOV can provide them). io_uring is increasingly available — it works in containers, with cgroups, and without special privileges — but SQPOLL mode requires CAP_SYS_NICE.
Diminishing returns: If your service’s bottleneck is the backend database, not the network, kernel bypass on the frontend is premature optimization. Profile first; bypass second.

8. The Future: SmartNICs, eBPF, and the Evolving Boundary

The kernel bypass landscape is evolving rapidly. Three trends are reshaping the space:

SmartNICs (DPUs/IPUs): NVIDIA’s BlueField, Intel’s IPU, and Marvell’s Octeon put ARM cores on the NIC itself, capable of running DPDK, OVS, or even a lightweight Linux kernel. This moves the bypass from the host CPU to the NIC, freeing host cores for application work.
eBPF and XDP (Express Data Path): eBPF allows safe, kernel-verified programs to run in the kernel’s network driver at the earliest point (XDP hook, before sk_buff allocation). XDP programs can drop, redirect, or forward packets without entering the full stack. This is a form of in-kernel bypass — not as fast as DPDK, but retaining the kernel’s safety guarantees.
io_uring for everything: Jens Axboe and the Linux community continue expanding io_uring’s scope. Recent additions include IORING_OP_URING_CMD (passthrough commands to device drivers), zero-copy send (IORING_OP_SEND_ZC), and multi-shot operations. The vision is for io_uring to become the universal userspace I/O interface, making raw DPDK necessary only for the most extreme workloads.

9. Summary

Kernel bypass is a response to a fundamental mismatch: the kernel’s networking stack, optimized for generality and decades of feature accumulation, cannot serve the latency and throughput demands of modern distributed systems. DPDK, io_uring, and RDMA represent three points on a spectrum of trade-offs between performance and development complexity.

DPDK gives you raw hardware access at the cost of rebuilding the network stack. io_uring gives you a 10x improvement over traditional sockets while keeping the kernel’s mature TCP implementation. RDMA rewrites the rules with one-sided remote memory access but demands careful hardware and security engineering.

The unifying principle across all three is the elimination of unnecessary intermediaries. Whether it is the kernel (DPDK), the system call interface (io_uring), or the remote CPU (RDMA), the message is the same: in a world where a microsecond matters, every layer between your code and the wire is a candidate for removal. The art of systems programming is knowing which layers you can afford to lose — and which you cannot.