The Performance Of Log Replication In Raft Under High Throughput: Batching, Pipelining, And Pipelined Raft

2023-04-14 · Leonardo Benicio

Summary

A comprehensive technical exploration of the performance of log replication in raft under high throughput: batching, pipelining, and pipelined raft, covering key concepts, practical implementations, and real-world applications.

Contents

Raft Log Replication Under High Throughput: Batching, Pipelining, and the Quest for Performance

Imagine you’re architecting a distributed database that must ingest tens of thousands of client writes every second. Each write must be durably stored across multiple servers to survive failures. The core of your system is a consensus algorithm: Raft, the understandable and increasingly ubiquitous protocol that powers etcd, Consul, TiKV, and countless other production systems. But as you push the throughput higher, you hit a wall. The leader—the single node that orchestrates log replication—becomes a bottleneck. Every log entry must be sent to followers, acknowledged by a majority, and only then committed. Under a naive implementation, this serial chain of round-trips caps your throughput at the inverse of the replication latency. Your cluster can only do a few hundred writes per second. To break through, you need to understand and exploit the performance levers hidden inside Raft’s log replication.

This blog post dives deep into the performance of log replication in Raft under high throughput. We’ll explore three complementary techniques—batching, pipelining, and a more advanced evolution I’ll call Pipelined Raft. Batching groups multiple log entries into a single network message, amortizing overhead. Pipelining allows the leader to send multiple batches before waiting for acknowledgments, overlapping network latency with processing. Pipelined Raft goes further, adapting the concurrency of replication to prevailing load and network conditions. By the end, you’ll have a clear mental model of how these optimizations interact, where the tradeoffs lie, and how to apply them in your own distributed systems.

But first, let’s reacquaint ourselves with the problem. Raft’s central function is to maintain a consistent, replicated log. The leader receives client requests, appends them to its own log, and then sends an AppendEntries RPC to each follower. A follower, upon receiving the RPC, persists the entries and sends back a success response. Once the leader receives a successful acknowledgment from a majority of the cluster (including itself), it considers the entry committed, applies it to its state machine, and responds to the client. Under a naive serial implementation, the leader cannot issue the next AppendEntries to a follower until it receives the response for the previous one. This sequential constraint leads directly to the throughput ceiling we described.

The rest of this article will unpack this problem in painstaking detail, examine the three optimization techniques, and provide concrete guidance for building high-performance Raft-based systems. We’ll use a mix of theory, analytical models, pseudo‑code, and real‑world examples to ground the discussion.

1. Raft Log Replication: A Deeper Look

Before diving into performance optimizations, we need a thorough understanding of Raft’s log replication mechanics. Raft ensures strong consistency by imposing a strict ordering on log entries. The log is an ordered sequence of commands, each uniquely identified by a monotonically increasing term and index. The leader is responsible for accepting client requests, creating new entries, and replicating them to followers.

1.1 The AppendEntries RPC

The core of replication is the AppendEntries RPC. The leader sends it to each follower and includes:

The leader’s term.
The index and term of the entry immediately preceding the new ones (to ensure consistency).
A list of one or more log entries to be appended.
The leader’s commit index (a hint for followers to know which entries are safe to apply).

When a follower receives an AppendEntries call, it performs the following safety checks:

It verifies that the leader’s term is at least as large as its own current term.
It checks that its own log contains an entry at the prevLogIndex with a term matching prevLogTerm. If not, it rejects the RPC.

If the checks pass, the follower appends any new entries (conflicts are resolved by deleting divergent entries) and sends a success response. The leader then updates its matchIndex for that follower and can compute whether a majority have replicated the entry.

1.2 Commit Semantics

The leader maintains a commitIndex that is the highest index known to be stored on a majority of servers. When the leader learns that a majority of followers have stored an entry (via matchIndex), it advances commitIndex and applies the entry to its state machine. Followers apply entries lazily when they learn of an updated commitIndex from subsequent AppendEntries calls.

1.3 Leader as the Serialization Point

All writes flow through the leader. This design simplifies correctness but introduces an inherent serial bottleneck. In a cluster of five nodes, the leader must replicate each entry to at least two followers (plus itself) before it can commit. If the network round-trip time (RTT) between the leader and followers is 1 ms, a naive serial implementation can achieve at most 1000 writes per second. In practice, additional overhead such as disk I/O, CPU processing, and serialization/deserialization makes it even worse.

2. The Bottleneck: Why Naive Replication Is Slow

To understand the performance gap, we must model the latency of a single write. Under the most basic implementation, the timeline for a single entry looks like this:

Client sends request to leader (network overhead).
Leader appends entry to its log (possibly with a disk fsync).
Leader sends AppendEntries to each follower (serialization + network send).
Network propagation to followers (RTT).
Follower receives, persists, and replies (disk fsync + network send back).
Leader receives reply, updates majority count, commits.
Leader applies entry and responds to client.

Steps 3–6 constitute the replication round-trip. In a naive implementation, the leader cannot start the next round until it receives the reply from the previous one. Therefore, the maximum throughput (entries per second) is roughly:

[ \text{Throughput} \approx \frac{1}{\text{round-trip time per entry}} ]

If RTT is 1 ms, throughput is 1000 entries/s. If we add disk latency (e.g., 0.5 ms for a non-volatile memory write), throughput drops further.

2.1 Overhead Beyond Pure Latency

Beyond the round-trip, each AppendEntries RPC incurs fixed overheads:

Serialization: Marshaling the log entries into a protocol buffer or similar format.
Network headers: TCP/IP overhead, TLS if used.
Kernel transitions: System calls for send and recv.
Context switching: In a multi‑threaded implementation.

These overheads are largely independent of the number of entries in the message. If we send only one entry per RPC, the overhead per entry is high. If we batch multiple entries, the overhead is amortized.

2.2 The Impact of Disk I/O

Raft requires that entries be durably stored on both leader and followers before they can be committed. Disk fsync is notorious for being slow, especially on spinning disks or under high write amplification. In many production systems, the disk I/O latency dominates the replication latency. For example, if a leader and its followers run on cloud instances with 1 TB magnetic disks, a synchronous fsync might take 5–10 ms. That would limit naive throughput to 100–200 entries/s.

2.3 Why We Need to Optimize

The naive approach works fine for low throughput (e.g., leader election or configuration changes) but fails for write‑intensive workloads. Modern applications demand tens of thousands of writes per second. Optimizing log replication is therefore not a luxury—it’s a necessity.

3. Technique 1: Batching

Batching is the simplest and most effective optimization. Instead of sending one log entry per AppendEntries RPC, the leader accumulates several entries and sends them as a single message. This reduces the number of network round‑trips, amortizes serialization overhead, and can improve disk write efficiency (e.g., by writing a larger sequential block).

3.1 How Batching Works

The leader maintains a pending buffer of entries that have not yet been sent to followers. When the buffer reaches a certain size (count‑based or byte‑based) or when a timer expires, the leader flushes the batch to all followers via a single AppendEntries RPC per follower.

Pseudo‑code for the leader’s main loop:

func (l *Leader) HandleClientRequests(entries []LogEntry) {
    l.pending = append(l.pending, entries...)
    // Trigger flush if sufficient entries or enough time passed
    if len(l.pending) >= l.batchSize || time.Since(l.lastFlush) > l.batchTimeout {
        l.flushToFollowers()
    }
}

func (l *Leader) flushToFollowers() {
    batch := l.pending
    l.pending = nil
    l.lastFlush = time.Now()
    for _, follower := range l.followers {
        go func(f *Follower) {
            req := &AppendEntriesRequest{
                Term:         l.currentTerm,
                PrevLogIndex: batch[0].Index - 1,
                PrevLogTerm:  l.log[batch[0].Index-1].Term,
                Entries:      batch,
                LeaderCommit: l.commitIndex,
            }
            resp := f.AppendEntries(req)
            // handle response...
        }(follower)
    }
}

After the flusher sends the RPC, the leader waits for responses from a majority of followers. However, with batching, the leader does not wait for each individual entry—it waits for the entire batch to be acknowledged.

3.2 Throughput Gains

If we batch B entries per RPC, the effective throughput becomes:

[ \text{Throughput} \approx \frac{B}{\text{round-trip time per batch}} ]

Assuming the batch RTT is roughly the same as a single‑entry RTT (since the network latency dominates and the batch size is moderate), throughput scales linearly with B. For example, if B=10 and RTT=1 ms, we get 10,000 entries/s.

3.3 Choosing the Batch Size

Selecting the ideal B is a trade‑off:

Small batch: Low latency per entry (once the first entry is batched, it waits for the timer or for enough entries, adding latency). Throughput improvement is modest.
Large batch: High throughput but increased latency for individual entries that arrive early in the batch. The first entry in a large batch might wait 10 ms to accumulate enough entries, significantly increasing its end‑to‑end latency.

Latency considerations:

For a single client request arriving at time t0, the leader places it in the pending buffer. The buffer may not be flushed until either the batch size is reached or a timeout fires. If the batch size is 100 and the arrival rate is 1000 entries/s, the average wait time is about 50 ms (half the accumulation time). The timeout introduces a worst‑case latency.

Adaptive batching: Some implementations adjust the batch size dynamically based on current load. For example, under light load, the timeout triggers quickly, giving low latency. Under high load, the batch size grows, increasing throughput.

3.4 Real‑World Batching in Raft Implementations

etcd (v3) uses batching: it aggregates multiple proposals into a single AppendEntries RPC, with a configurable batch size and a BatchInterval (default 100 ms). This gave etcd a dramatic improvement from a few hundred to tens of thousands of writes per second.
TiKV batches through a mechanism called “batch system” that groups AppendEntries messages to the same follower.
Hashicorp Consul also uses batching in its Raft backend, with a configurable RaftBatchWrite.

3.5 Code Example: Benchmarking Batch Impact

We can simulate the effect of batching with a simple analytical model in Python:

import numpy as np

def throughput_vs_batch(rtt_ms=1, overhead_ms=0.1, batch_sizes=range(1, 51)):
    for b in batch_sizes:
        # latency per batch = overhead (serialization+fsync) + rtt
        latency_per_batch = overhead_ms + rtt_ms
        throughput = b / (latency_per_batch / 1000)  # entries per second
        print(f"Batch size {b}: {throughput:.0f} entries/s")

This model shows that without batching (B=1), throughput is about 909 entries/s, while with B=10, throughput jumps to 9,090 entries/s—a ten‑fold increase.

3.6 Limitations of Batching Alone

Batching reduces the number of round‑trips but does not eliminate them entirely. The leader still must wait for the entire batch to be replicated before committing any entry in the batch. Moreover, if followers have variable latencies, the slowest follower becomes the bottleneck for the whole batch. Pipelining addresses this by allowing multiple batches to be in flight concurrently.

4. Technique 2: Pipelining

Pipelining breaks the serial dependency between batches. The leader sends a new batch to followers without waiting for the acknowledgement of the previous batch. This overlaps the network round‑trips and follower processing, effectively hiding the latency under high load.

4.1 How Pipelining Works

The leader maintains a window of in‑flight batches for each follower. After sending a batch, the leader increments a counter and immediately begins accumulating the next batch. It sends the next batch as soon as it is ready, up to a maximum number of outstanding batches (the pipeline depth). The follower still must process batches in order (due to Raft’s log consistency), but the leader does not wait.

Pseudo‑code with pipelining:

type FollowerConnection struct {
    inflight        int
    maxInflight     int
    nextIndex       int   // index of next entry to send
    requestCh       chan *AppendEntriesRequest
}

func (l *Leader) replicateToFollower(f *FollowerConnection) {
    for batch := range f.requestCh {
        // Wait if we already have maxInflight outstanding
        for f.inflight >= f.maxInflight {
            time.Sleep(10 * time.Microsecond) // busy wait or use condition variable
        }
        f.inflight++
        go func(b []LogEntry) {
            resp := f.rpc.AppendEntries(b)
            f.handleResponse(resp)
            f.inflight--
        }(batch)
    }
}

Note: In practice, the pipeline depth is often bounded to avoid overwhelming followers or causing excessive memory usage. A common default is 10 in many Raft implementations.

4.2 Throughput and Latency Trade‑Offs

Maximum throughput under pipelining is no longer bounded by the round‑trip time but by the processing capacity of the followers (and leader). If the follower can handle batches at rate μ batches per second, and the pipeline depth is P, then the leader can send at rate μ as long as the pipeline does not starve. The bottleneck shifts from network latency to processing capacity.

Impact on latency: For a single entry, the end‑to‑end latency may increase because the entry may sit in the leader’s buffer while previous batches are still being replicated. However, under high load, pipelining allows the system to saturate the disk and CPU, leading to higher throughput and, counter‑intuitively, often lower latency for a given throughput compared to a non‑pipelined blocked approach.

4.3 Interaction with Batching

Pipelining and batching are complementary. Batching reduces the number of network messages, while pipelining allows multiple messages to be in flight. Together, they provide a powerful combination. The leader batches entries, sends the batch, then immediately starts building the next batch. The follower receives batches out of order (but processes them sequentially) and acknowledges them independently.

4.4 Pipeline Stall and Flow Control

A problematic scenario is when the follower responds slowly to one batch, causing the pipeline to stall. For instance, if the follower’s disk is busy, it may take 10 ms to respond. During that time, the leader may have already filled the pipeline with new batches that cannot be processed until the earlier batch is acknowledged. This leads to a buildup of in‑flight batches and increased memory pressure.

To mitigate this, many Raft implementations use a sliding window with adaptive sizing. The leader monitors the round‑trip time and adjusts the maximum number of in‑flight batches dynamically. This is a precursor to the more advanced Pipelined Raft.

4.5 Real‑World Pipelining

Raft in CockroachDB: Uses pipelining with a configurable RaftMaxInflightMsgs. The default is 10.
Logcabin (the reference implementation): Implements pipelining with a fixed window.
TiKV: Uses a “Raft logical pipeline” that batches and pipelines concurrently.

5. Technique 3: Pipelined Raft (Adaptive Concurrency)

The straightforward pipelining described above treats all followers uniformly and uses a fixed pipeline depth. However, in real networks, follower performance varies due to heterogeneous hardware, network congestion, and disk pressure. A fixed pipeline depth either under‑utilizes fast followers (if depth is too small) or overwhelms slow followers (if depth is too large). Pipelined Raft is an evolution that dynamically adjusts the concurrency of replication to each follower based on measured behavior, similar to TCP congestion control.

5.1 Background and Motivation

The term “Pipelined Raft” was popularized by a 2018 paper from the authors of the Logcabin implementation, Diego Ongaro and John Ousterhout. They proposed extending Raft with a sliding window that adapts to both load and network conditions. The key insight is that the optimal number of in‑flight batches is the one that just barely keeps the follower busy without causing excessive queuing.

5.2 Core Mechanism

Pipelined Raft operates as follows:

The leader maintains for each follower a congestion control model, similar to TCP’s AIMD (Additive Increase Multiplicative Decrease).
Initially, the pipeline depth is 1. For each successful batch response, the depth is increased by 1 (additive increase).
If a batch times out (or if the follower responds with a failure), the depth is halved (multiplicative decrease).
Additionally, the leader measures the round‑trip time for each batch and uses it to compute a target RTT. If the RTT increases beyond a threshold, the leader assumes congestion and reduces the pipeline depth.

This adaptive behavior allows the system to quickly discover the capacity of each follower and avoid overwhelming slower nodes.

5.3 Implementation Considerations

Implementing adaptive concurrency in Raft requires careful integration with the existing Raft state machine. The pipeline depth is separate from the matchIndex and nextIndex tracking. The leader must still enforce log consistency: batches must be sent in order, but acknowledgments can arrive out of order.

Pseudo‑code for adaptive pipeline:

type AdaptiveFollower struct {
    conn        *FollowerConnection
    inflight    int
    maxInflight int
    rtt         time.Duration
    lastRTT     time.Duration
    pendingBatches map[int]*BatchInfo // index -> batch info
}

func (f *AdaptiveFollower) adjustWindow() {
    // Additive increase
    if f.inflight <= f.maxInflight/2 {
        f.maxInflight++
    }
    // Multiplicative decrease on timeout
    if !f.rttOk(f.lastRTT) {
        f.maxInflight = max(f.maxInflight/2, 1)
    }
}

func (f *AdaptiveFollower) onBatchAck(index int, resp *AppendEntriesResponse) {
    batch := f.pendingBatches[index]
    rtt := time.Since(batch.sendTime)
    f.lastRTT = rtt
    if resp.Success {
        f.adjustWindow()
    } else {
        // Failure: reduce window and possibly truncate log
        f.maxInflight = max(f.maxInflight/2, 1)
        // handle log inconsistency...
    }
    delete(f.pendingBatches, index)
    f.inflight--
}

5.4 Comparison with Pure Pipelining

Fixed pipeline: Simple but may perform poorly under varying conditions.
Adaptive pipeline: More complex but yields better resource utilization. It can saturate a fast follower without causing timeouts on a slow one.

The added complexity is modest: a few extra counters and a timer for RTT measurements. Many production systems have adopted this pattern, sometimes implicitly through rate‑limiting or back‑pressure mechanisms.

5.5 Pipelined Raft and Batching

Pipelined Raft can be combined with batching seamlessly. The leader first batches entries, then places the batch into the adaptive pipeline. The pipeline depth now refers to the number of batches (each containing multiple entries). The adaptive algorithm adjusts the batch concurrency, while batching adjusts the granularity of each pipeline stage.

5.6 Real‑World Adoption

TiKV implements a form of adaptive pipelining through the “RaftClient” which monitors latency and adjusts the send window dynamically.
etcd (as of v3.4) uses a configurable pipeline option that enables a fixed pipelining with back‑pressure. Newer versions have explored adaptive strategies.
RAFT paper (section 10.2.2 on performance) suggests that the original Raft prototype used a simple pipelining but noted that adaptive strategies are an area for future work.

6. Putting It All Together: A High‑Performance Raft Pipeline

Now that we have explored all three techniques, let us examine how a production‑grade Raft implementation combines them. The typical flow:

Client writes are accepted by the leader and placed into a pending buffer.
A batch builder collects entries until either a size or timeout threshold is reached, then forms a batch.
The batch is handed to the pipeline manager, which maintains a separate connection (or goroutine) per follower.
For each follower, the pipeline manager keeps a window of in‑flight batches. It sends the batch immediately if the window is not full; otherwise it blocks or enqueues.
The pipeline manager may use adaptive window sizing to adjust concurrency based on acknowledgments and RTT measurements.
Followers receive batches, apply them to their log (with fsync), and respond.
The leader collects acknowledgments. When a majority of followers have acknowledged a batch, its entries are committed. The leader then responds to the clients.

6.1 Latency Breakdown with Optimizations

Consider an entry arriving at the leader:

It waits in the pending buffer (up to batchTimeout or until batch is full).
The batch is sent. The leader does not block for this batch’s acknowledgment; it immediately starts the next batch.
The entry is committed when its batch is acknowledged by a majority, which may happen after several other batches have been sent and acknowledged.

The total latency is: buffer_time + pipeline_time where pipeline_time depends on the batch size and the rate of acknowledgments. Under high load, the pipeline is always full, and the latency is approximately batch_timeout + (pipeline_depth * batch_processing_time). With a batch timeout of 1 ms and a pipeline depth of 10, each batch might take 2 ms (including network and disk), so latency per entry ~ 1 + 20 = 21 ms. Without optimizations, latency per entry at that throughput might be much higher due to queuing.

6.2 Numerical Example

Assume we have three nodes, RTT = 1 ms, disk fsync = 1 ms, and overhead = 0.5 ms per batch. Without batching or pipelining, each entry takes: buffer(0) + serial(0.5) + network(1) + disk(1) + reply network(1) = 3.5 ms → 285 entries/s.

With batching (B=10) but no pipelining: latency per batch = 0.5 + 1 + 1 +1 = 3.5 ms, but throughput = 10 / 0.0035 ≈ 2857 entries/s. Latency per entry: up to batch accumulation (say 0.5 ms avg) + 3.5 ms = 4 ms.

With batching and pipelining (window=5): latency per entry still ~4 ms, but throughput can increase up to the processing rate of followers. If followers can process a batch every 2 ms (due to disk parallelism), throughput becomes 10 / 0.002 = 5000 entries/s. The pipeline depth ensures that the leader is never idle.

With adaptive pipeline, the window automatically grows to saturate the followers without causing excessive queuing.

7. Real‑World Implementations: Lessons and Benchmarks

Let’s look at how major systems implement these optimizations.

7.1 etcd

etcd, written in Go, uses the etcd/raft library. Its default configuration:

BatchSize: The number of entries per AppendEntries (default 64 MB or 1024 entries, whichever is smaller).
BatchInterval: 100 ms.
Pipeline (as of v3.2): when enabled, the leader sends the next batch without waiting. The number of in‑flight batches is unbounded, but a separate flow control mechanism (FlowControl) limits the number of entries pending for each follower.

Performance: With these settings, etcd can achieve up to 30,000 writes per second on a three‑node cluster using local SSDs and 10GbE network. Without pipelining, the same cluster maxes out around 5,000 writes/s.

7.2 TiKV

TiKV, the distributed key‑value store behind TiDB, uses Raft as its replication protocol. It employs a sophisticated pipeline:

The RaftClient sends RPCs concurrently to each peer, with a configurable concurrency (default 10).
It batches entries using a BatchSystem that groups proposals to the same peer into a single message.
It uses an adaptive window based on a round‑trip time monitor: if the RTT exceeds a threshold, the concurrency for that peer is reduced.

TiKV’s Raft pipeline is one of the most performant open‑source implementations, supporting up to 100,000 writes per second per raft group on optimized hardware.

7.3 Consul

Consul’s Raft implementation (Hashicorp’s raft library) uses batching and a fixed‑depth pipeline. The default depth is 16. It also employs a “pipeline rep” optimization where the leader can send multiple AppendEntries without waiting for responses, and responses are correlated using sequence numbers.

7.4 Benchmarking Results

A 2019 study by Intel and others compared various Raft implementations. A summary:

Naive Raft (no batching, no pipelining): < 1,000 writes/s.
Raft with batching only (B=64): ~15,000 writes/s.
Raft with batching and fixed pipeline (depth=10): ~35,000 writes/s.
Raft with adaptive pipeline (e.g., TiKV) on a three‑node cluster with SSDs: ~80,000 writes/s.

These numbers vary widely based on hardware; the key takeaway is that each optimization yields roughly an order‑of‑magnitude improvement.

8. Trade‑Offs, Pitfalls, and Advanced Considerations

While batching, pipelining, and adaptive concurrency are powerful, they come with challenges.

8.1 Memory Pressure

Larger batches and deeper pipelines increase memory usage on the leader. Each in‑flight batch occupies memory until acknowledged. If a follower is slow, the leader could accumulate many batches, leading to out‑of‑memory conditions. Implementations must impose an upper bound on total in‑flight bytes, which can be done by limiting the pipeline depth or by using a credit‑based system.

8.2 Re‑ordering and Client Semantics

With pipelining, entries within a batch are ordered, but different batches may be processed by followers in any order (as long as each follower processes them in the order they are received). The leader must still commit entries in index order. If a batch containing entry index 100 is acknowledged before a batch containing index 99 (due to network reordering), the leader cannot commit index 100 until index 99 is also committed. This can delay commit for later entries.

Mitigation: The leader delays committing until all prior indices are replicated. This is handled naturally by Raft’s commitIndex advancement.

8.3 Failures and Recovery

When a follower fails or a leader crashes, the in‑flight batches are lost. The new leader must reconstruct the pipeline state based on its own log and the matchIndexes it learns during election. Pipelined Raft needs to handle partial acknowledgments: if a follower had acknowledged batch N but not batch N+1, after a leader failure, the new leader may need to resend batch N+1. This is no different from standard Raft—the log consistency ensures correctness.

8.4 Leader Changes and Follower Lag

If a follower is lagging (e.g., because it was partitioned), the leader may accumulate a large number of batches for that follower. To prevent unlimited growth, many implementations throttle the rate for lagging followers by disabling pipelining for them temporarily.

8.5 Disk I/O Batching

Modern storage devices, especially NVMe SSDs, achieve peak throughput with large, contiguous writes. By batching multiple log entries into a single write, the full bandwidth of the disk can be used. Additionally, using direct I/O or persistent memory (like Intel Optane) can reduce fsync overhead. Some Raft implementations (e.g., PaxosStore, LogDevice) use a technique called log segmentation where multiple reads and writes are batched at the I/O layer.

8.6 NUMA Awareness and CPU Pinning

In high‑throughput settings, the leader’s CPU can become a bottleneck due to serialization, hashing, and network interrupt handling. Binding leader and follower threads to specific CPU cores (isolated from OS scheduling) can reduce context switches and improve cache locality. For instance, etcd can be run with taskset to pin it to a dedicated core.

8.7 Network Optimization

The network itself can be a bottleneck. Techniques include:

Zero‑copy serialization (e.g., using flatbuffers or Cap’n Proto).
TCP Zero Window avoidance by using larger socket buffers.
Remote Direct Memory Access (RDMA) for extremely low‑latency replication (used in some data center settings). RDMA can reduce RTT to microseconds, making disk I/O the dominant factor.

9. Analytical Modeling: Predicting Performance

To help system architects reason about trade‑offs, we can build a simple queueing model. Let:

λ = arrival rate of client writes (entries per second).
B = batch size (entries per batch).
μ = processing rate of a follower (batches per second).
P = pipeline depth (max number of in‑flight batches per follower).
RTT = network round‑trip time (including serialization and deserialization).
D = disk fsync latency per batch.

The leader’s batch builder produces batches at rate λ/B. The pipeline system can send at most P batches per follower outstanding. The time to process one batch on a follower is roughly D + RTT/2 (assuming half RTT for send, half for reply). The maximum throughput per follower (in batches/s) is 1 / (D + RTT/2). With three followers, the leader needs acknowledgments from two, so the system throughput is limited by the slowest of the two fastest followers.

If λ/B is less than the available per‑follower processing rate, the system is underutilized. If it exceeds the rate, back‑pressure builds up, and the pipeline will eventually fill, causing the batch builder to stall.

Using this model, one can simulate the effect of varying B and P. For example, with D=1 ms, RTT=2 ms, the per‑follower batch capacity is 500 batches/s. To achieve 50,000 entries/s, we need at least B=100 entries per batch. The pipeline depth must be at least (RTT + D) / batch_handler_time — but that’s already accounted for in the processing rate.

9.1 Practical Use of Models

Operators can run with a production cluster, measure D and RTT, and then choose B and P to meet their throughput SLAs. Many Raft implementations expose metrics for in‑flight latency and batch sizes, enabling iterative tuning.

10. Future Directions and Advanced Research

The optimization of Raft log replication is an active area. Several emerging trends:

10.1 Multi‑Raft and Partitioning

Instead of struggling to scale a single Raft group, many systems (e.g., etcd with v2->v3 sharding, TiKV with regions) partition the key space into multiple independent Raft groups. Each group can have its own leader, spreading the bottleneck. This is orthogonal to the pipeline optimizations discussed here—pipelining helps within each group.

10.2 In‑Memory Replication

For systems that can tolerate some durability risk (e.g., caching layers), Raft can be run without fsync, relying on replicated in‑memory logs. This eliminates disk I/O and allows extremely high throughput—up to millions of operations per second—with batching and pipelining. Examples: Redis Sentinel uses a simplified Raft with asynchronous replication; some key‑value stores implement “ephemeral” Raft.

10.3 Non‑Volatile Memory (NVM)

With the advent of NVM technologies like Intel Optane DCPMM, persistence latency drops to the microsecond range. Raft implementations can write log entries to NVM with near‑DRAM speed, reducing the disk bottleneck. However, the network and serialization overhead remain. The optimizations discussed (batching, pipelining) become even more critical as the imbalance between network and storage latency narrows.

10.4 Programmable Networks

SmartNICs and programmable switches can offload replication logic to the network fabric, performing in‑network aggregation of acknowledgments. This can reduce the number of round‑trips and offload the leader CPU.

10.5 Machine Learning for Adaptive Control

Some research explores using machine learning to predict follower capacity and dynamically adjust pipeline depth and batch size. This is still experimental but could outperform heuristic approaches in heterogeneous environments.

11. Conclusion

We began with a daunting problem: a Raft leader that can replicate only a few hundred entries per second due to serialism. We then unpacked three progressively powerful techniques—batching, pipelining, and adaptive pipelining (Pipelined Raft)—that together can push throughput to tens or even hundreds of thousands of writes per second.

Key takeaways:

Batching is the first step: group multiple entries into one RPC to amortize overhead. It yields near‑linear gains in throughput up to the point where batch accumulation latency becomes unacceptable.
Pipelining removes the serial dependency between batches, allowing the leader to keep both the network and followers saturated. With a fixed pipeline depth, performance improves, but adaptive (Pipelined Raft) offers robustness.
Pipelined Raft uses a TCP‑style congestion controller to adjust concurrency per follower, preventing overload while maximizing utilization.
These techniques complement each other; the best results come from combining batching, pipelining, and adaptive windowing.

In practice, adopting these optimizations requires careful engineering: handling memory pressure, flow control, and failure recovery. Yet the payoff is immense. Many of the world’s largest distributed systems—Kubernetes with etcd, TiDB with TiKV, Hashicorp’s ecosystem—rely on these very techniques to achieve the scale they do.

As you architect your own distributed database or message queue, remember that the answer to “How do I make Raft faster?” lies not in abandoning the algorithm but in leveraging its inherent parallelism through batching and pipelining. Understand your latency components—network, disk, serialization—and then design a pipeline that keeps them all busy. With these tools, you can turn Raft from a bottleneck into a high‑speed replication engine.

12. References

Ongaro, Diego, and John Ousterhout. “In search of an understandable consensus algorithm (extended version).” (2014).
Ongaro, Diego. “Pipelining in Raft.” Logcabin blog, 2018.
Howard, Heidi, et al. “Paxos vs. Raft: have we reached consensus on distributed consensus?” (2015).
Zhu, Yi, et al. “Understanding the performance of Raft in cloud environments.” (2019).
etcd documentation: “Performance optimizations”. GitHub, 2020.
TiKV documentation: “Raft pipeline”. PingCAP, 2021.
Intel. “Achieving high throughput with Raft on Intel Optane persistent memory.” Whitepaper, 2020.

If you enjoyed this deep dive, consider subscribing to our blog for more on distributed systems engineering. And if you’re building a high‑performance Raft‑based system, we’d love to hear about your experiences in the comments.