The Performance Of Asynchronous Replication Vs. Synchronous Replication In Distributed Databases

2024-04-07 · Leonardo Benicio

Summary

A comprehensive technical exploration of the performance of asynchronous replication vs. synchronous replication in distributed databases, covering key concepts, practical implementations, and real-world applications.

Contents

The Replication Crossroads: Synchronous vs. Asynchronous in Modern Databases

The coffee stain spreading across your hastily written note is a minor tragedy. The credit card transaction that fails, leaving you stranded at the airport car rental counter, is an inconvenience. But the bank transfer that posts twice, duplicating a $50,000 mortgage payment, or the medical record that vanishes moments before a critical surgery—these are system failures of a different magnitude. In the modern digital world, the line between a minor glitch and a catastrophic data event is drawn by a single, often invisible architectural choice: how your database replicates its data across multiple machines.

Every application you use—from the social media feed you scroll during breakfast to the financial dashboard you check before bed—relies on a database that is almost certainly not running on a single server. The era of the monolithic, single-node database is, for any application of scale, a relic. We live in a world of distributed systems, where data is sharded across continents and replicated across data centers to ensure it survives a server fire, a cloud region outage, or a sudden, massive spike in traffic.

But survival is not a simple binary state. It is a spectrum. And at the heart of this spectrum lies a fundamental tension that has haunted database architects, backend engineers, and site reliability engineers for decades: the choice between synchronous and asynchronous replication.

This choice is not a mere technical footnote. It is the very axis upon which the character of your system is defined. Do you want your database to be a steadfast, unyielding castle, where every piece of data is locked in stone and agreed upon before you move forward? Or do you want it to be a highly reactive, fast-moving courier service, accepting packages at a blistering pace while trusting they’ll eventually reach their destination? The former is safe, consistent, and slow. The latter is fast, scalable, and… risky.

Imagine, for a moment, that you are not a database administrator but the chief architect of a global payment system. Every second, thousands of transactions flow through your infrastructure. A user in Singapore buys a coffee, a factory in Shenzhen pays a supplier, a freelancer in Berlin receives payment from a client in New York. Your database must guarantee that no money is lost, no transaction is duplicated, and yet the system must respond in milliseconds. How do you choose between consistency and availability? How do you decide whether to wait for confirmation from a replica in another time zone or to forge ahead optimistically?

This blog post will take you deep into the heart of that decision. We will explore the mechanics of synchronous and asynchronous replication, dissect their strengths and weaknesses, and examine the real-world systems that embody each approach. We will look at the mathematical underpinnings of consensus, the horrors of split-brain scenarios, and the trade-offs that led to the creation of hybrid replication strategies. By the end, you will not only understand the difference between these two replication models but will be equipped to make architectural decisions that can mean the difference between a robust system and a catastrophic failure.

1. The Foundations: Why Replication Exists

Before we dive into the mechanics of synchronous vs. asynchronous replication, we must first understand why replication is necessary at all. The answer lies in three fundamental goals: fault tolerance, high availability, and scalability.

Fault Tolerance

Hardware fails. Disks crash, memory bits flip due to cosmic radiation, network cables get cut by backhoes, and entire data centers can be taken offline by hurricanes or human error. If your data exists on only one machine, the moment that machine dies, your data is gone. Replication mitigates this risk by storing copies of the data on multiple physical nodes. In the event of a failure, the system can continue operating using one of the remaining replicas.

High Availability

Even if a single server has high uptime (say 99.9%, which is about 8.76 hours of downtime per year), a distributed system with multiple replicas can achieve much higher availability. With three replicas distributed across independent failure domains, you can tolerate the loss of one or even two replicas without service interruption. This is the principle behind “five nines” (99.999%) uptime: you need a system designed to handle failures gracefully.

Scalability

Replication also allows you to scale read throughput. By distributing read requests among multiple replicas, you can serve far more concurrent users than a single node could handle. This is why large-scale web applications like Facebook, Twitter, and Amazon rely on multiple read replicas. However, scaling writes is more complex—that’s where the choice of replication strategy becomes critical.

The CAP Theorem and Its Influence

No discussion of replication is complete without reference to the CAP theorem, formulated by Eric Brewer in 2000. The theorem states that a distributed data store can simultaneously guarantee only two of the following three properties:

Consistency: Every read receives the most recent write or an error.
Availability: Every request receives a (non-error) response, without guarantee that it contains the most recent write.
Partition Tolerance: The system continues to operate despite arbitrary message loss or network failure between nodes.

In practice, network partitions (P) are unavoidable in distributed systems. Therefore, we are forced to choose between consistency (C) and availability (A) when a partition occurs. Synchronous replication tends to favor consistency (CP), while asynchronous replication leans toward availability (AP). However, as we shall see, the choice is not binary—many systems offer tunable consistency levels that navigate the CAP trade-offs.

2. Synchronous Replication: The Unyielding Castle

Synchronous replication is the architectural equivalent of a committee that insists on unanimous consent before taking any action. In a synchronous replication system, when a write operation is performed on the primary (or leader) node, it must be acknowledged by a specified number of replicas (typically all) before the write is considered successful and returned to the client.

How Synchronous Replication Works

Let’s walk through a typical synchronous write operation in a leader-based replication system:

Client sends a write request to the leader node (also called the primary or master).
The leader writes the data to its local storage (often to a write-ahead log, WAL) and simultaneously sends the write to all follower (or replica) nodes.
Each follower applies the write to its own local storage and sends an acknowledgment back to the leader.
The leader waits until it has received acknowledgments from a predefined quorum (e.g., all followers, or a majority).
Only then does the leader commit the transaction and return success to the client.

The critical point is step 4: the leader blocks until the required number of replicas have confirmed. This blocking behavior is what gives synchronous replication its consistency guarantee: if the leader crashes after step 5, any surviving follower that acknowledged the write will have the data, so the system can continue without loss.

Quorum and Majority

The concept of a quorum is central to synchronous replication. A quorum is the minimum number of nodes that must agree on a write (or read) for the operation to be considered valid. In a system with N nodes, a common choice is a majority quorum: floor(N/2) + 1. For example, with 3 nodes, a write quorum of 2 ensures that even if one node fails, the remaining two still have the latest data. This is the basis of the Paxos and Raft consensus algorithms.

Real-World Examples of Synchronous Replication

Google Spanner: Spanner uses synchronous replication via the Paxos consensus algorithm across data centers. Writes must be acknowledged by a majority of replicas in different zones. Spanner also leverages TrueTime, a GPS- and atomic-clock-based global time service, to provide external consistency (linearizability). The result is a globally distributed database that behaves like a single-machine database, but at the cost of increased latency (usually 10–50 ms for cross-region writes).
etcd and Consul: These distributed key-value stores are built on the Raft consensus algorithm, which uses synchronous replication. They are often used for coordinating configuration and service discovery in microservices architectures. A write to etcd is not considered successful until it is committed to a majority of nodes. This ensures strong consistency, which is critical for operations like leader elections and lock management.
MySQL Group Replication: MySQL offers a synchronous replication mode called Group Replication, which also uses Paxos-like consensus. All nodes in a group must agree on a transaction before it is committed. This provides strong consistency but limits throughput and increases latency compared to MySQL’s traditional asynchronous replication.

The Cost of Consistency: Latency and Throughput

The primary disadvantage of synchronous replication is performance. Because the leader must wait for acknowledgments from other nodes—often located in different data centers or even continents—the latency of a write operation can increase dramatically. Network round trips between data centers typically take tens to hundreds of milliseconds. If your application requires sub-millisecond write response times, synchronous replication may be infeasible.

Additionally, synchronous replication reduces write throughput because the leader can process only one write at a time while waiting for acknowledgments (unless it uses parallel replication, which adds complexity). In high-throughput systems like ad serving or real-time analytics, this bottleneck can be unacceptable.

The Danger of Blocking: Tail Latency and Availability

Synchronous replication also introduces a risk known as tail latency. If one of the followers is slow—due to a heavy load, a garbage collection pause, or a network hiccup—all write requests slow down to the speed of the slowest node. This can cascade into timeouts and retries, further degrading performance. In extreme cases, if a follower is completely unreachable, the leader may be unable to meet its quorum requirement, causing all writes to fail. This is the price of consistency under partition: the system becomes unavailable for writes rather than serving stale data.

Code Example: Simulating Synchronous Replication Logic

Below is a simplified pseudo-code representation of a synchronous write in a leader-based system with a quorum requirement:

class ReplicatedDatabase:
    def __init__(self, nodes, quorum_size):
        self.leader = nodes[0]
        self.followers = nodes[1:]
        self.quorum_size = quorum_size  # e.g., 2 out of 3

    def write_sync(self, key, value):
        # leader writes to its own log first
        self.leader.write_local(key, value)
        acks = 1  # leader counts as one acknowledgment

        # send write to followers
        for follower in self.followers:
            try:
                follower.write_local(key, value)
                acks += 1
            except Exception as e:
                # handle failure (maybe retry or mark follower as down)
                pass

        # wait for quorum
        while acks < self.quorum_size:
            # this could block indefinitely, but in practice use timeouts
            time.sleep(0.01)  # busy wait for illustration

        return "success"

In production, the implementation is far more sophisticated, involving persistent logs, timeout management, leader election, and crash recovery.

3. Asynchronous Replication: The Fast Courier Service

Asynchronous replication is the opposite of synchronous: the leader commits the write and returns success to the client without waiting for any replica acknowledgment. The data is then propagated to replicas in the background, typically through a log-based streaming mechanism.

How Asynchronous Replication Works

Client sends a write request to the leader.
The leader immediately applies the write to its local storage and commits the transaction.
The leader returns success to the client (often in milliseconds).
Sometime later, the leader sends the write to followers (or replicas) via a replication stream or log. The followers apply the write asynchronously.
If a follower fails to receive the update before the leader crashes, that write may be lost forever.

The key difference is that the client does not wait. The system trades consistency for availability and speed.

Real-World Examples of Asynchronous Replication

MySQL Asynchronous Replication: The classic MySQL replication setup (before Group Replication) uses asynchronous binary log replication. The master writes to its binary log, and a slave I/O thread pulls the log and applies events. If the master crashes before the slave receives the latest binary log entries, those writes are lost. This is why MySQL asynchronous replication is often used for read scaling, not for durability guarantees.
Cassandra (Default Mode): Cassandra offers tunable consistency, but its default replication is asynchronous. Writes are sent to a coordinator node, which then forwards the writes to replicas. The client can choose to wait for a specified number of acknowledgments (e.g., ONE, QUORUM, ALL). With a low consistency level like ONE, the write is considered successful after just one node acknowledges, making it effectively asynchronous.
Apache Kafka: Kafka uses asynchronous replication between brokers. Producers can choose an acks parameter: acks=0 (fire-and-forget), acks=1 (leader acknowledges), or acks=all (all in-sync replicas acknowledge). Even with acks=all, Kafka’s implementation is not fully synchronous because the leader does not wait for all replicas to flush to disk—it only waits for them to be in memory. This provides a balance.
DynamoDB (Default): Amazon DynamoDB replicates data asynchronously across three Availability Zones (AZs). A write is considered durable once it is committed in one AZ; replication to other AZs happens in the background. This gives DynamoDB extremely low write latency (single-digit milliseconds), but under rare failure conditions, a write could be lost if the primary AZ fails before replication completes.

The Benefits: Speed, Scalability, and Availability

Asynchronous replication is the engine behind many of the world’s fastest databases. By eliminating the round-trip wait, applications can achieve write latencies in the sub-millisecond range. This is essential for real-time use cases like gaming leaderboards, chat messaging, sensor data ingestion, and high-frequency trading (though trading often requires stronger consistency).

In addition, asynchronous replication improves write throughput because the leader can process writes without blocking on slow replicas. It also increases availability: if a replica fails, the leader continues to serve writes normally. Only when the leader itself fails do we see potential data loss.

The Price of Speed: Data Loss and Stale Reads

The most significant drawback of asynchronous replication is the potential for data loss. If the leader crashes before it has sent the latest writes to any follower, those writes are gone. The period of vulnerability is often called the replication lag. In a typical MySQL setup, the replication lag might be a few milliseconds to seconds. In a heavily loaded system, it can be minutes. For financial transactions, even a few milliseconds of data loss could be catastrophic.

Additionally, asynchronous replication leads to stale reads. If you read from a follower that has not yet received the latest write, you see an old version of the data. This can cause user-visible anomalies: you update your profile picture and see the old one for a few seconds, or you make a purchase and the inventory count appears unchanged until the replication catches up. Applications that rely on strong consistency (e.g., inventory management, account balances) cannot tolerate such staleness.

Code Example: Asynchronous Replication with a Replication Queue

A simplified asynchronous replication might look like this:

class AsyncReplicatedDatabase:
    def __init__(self, leader, followers):
        self.leader = leader
        self.followers = followers
        self.replication_queue = []  # pending writes to send

    def write_async(self, key, value):
        # leader commits immediately
        self.leader.write_local(key, value)
        # add to queue for background replication
        self.replication_queue.append((key, value))
        return "success"

    def background_replicate(self):
        # called periodically or in a separate thread
        while True:
            if self.replication_queue:
                key, value = self.replication_queue.pop(0)
                for follower in self.followers:
                    try:
                        follower.write_local(key, value)
                    except Exception as e:
                        # re-queue for retry? or log error
                        # if follower permanently down, may need to re-sync
                        pass
            time.sleep(0.001)  # yield control

In real systems, the replication queue is replaced by a persistent log (e.g., WAL or binlog) that allows replicas to catch up even if the leader restarts.

4. The Spectrum of Trade-offs: Semi-Synchronous, Quorum, and Hybrid Approaches

The binary choice between synchronous and asynchronous is often too coarse for real-world requirements. Many databases offer intermediate consistency levels that allow architects to fine-tune the trade-off between durability, latency, and availability.

Semi-Synchronous Replication

Semi-synchronous replication is a compromise: the leader waits for at least one replica to acknowledge the write before returning to the client. This provides a durability guarantee that at least one copy of the data exists on a different node, but the other replicas may be behind. This is significantly faster than full synchronous replication because you only wait for the fastest replica, not the slowest.

MySQL has a semi-synchronous plugin: after committing the transaction, the master pauses until at least one slave acknowledges that it has received and applied the event. If no slave acknowledges within a timeout, the master falls back to asynchronous mode. This reduces the window of potential data loss to a single replica failure.

Quorum-Based Approaches (Dynamo-Style)

Amazon Dynamo and its descendants (Cassandra, Riak, Voldemort) introduced the notion of tunable consistency. The client can specify:

W = number of replicas that must acknowledge a write.
R = number of replicas that must respond to a read.

By choosing W + R > N (where N is the replication factor), you can achieve quorum-based strong consistency without requiring all replicas to be synchronous. For example, for N=3, a write with W=2 and read with R=2 ensures that at least one overlapping replica has the latest data.

In this model, the replication is asynchronous between nodes, but the client waits for a quorum of acknowledgments. This provides a tunable span on the consistency-latency curve. Many NoSQL databases allow you to set W=1 for fast writes (essentially asynchronous) or W=ALL (like synchronous) at the cost of performance.

Hybrid Replication in Spanner and CockroachDB

Google Spanner and its open-source counterpart CockroachDB use a sophisticated hybrid approach:

They use synchronous replication via consensus (Paxos/Raft) for the metadata and transaction logs, ensuring that commit decisions are durable.
Data itself can be replicated asynchronously across regions for read performance, but the transaction commit requires a quorum in the region where the data resides.
Additionally, they use clock synchronization (TrueTime in Spanner, HLC in CockroachDB) to provide efficient read-only transactions that can avoid blocking on replication.

This approach allows them to achieve global strong consistency with acceptable latency for most use cases (tens of milliseconds), which is far better than full synchronous replication across continents without clever optimizations.

Trade-off Table

Approach	Durability	Write Latency	Read Consistency	Availability under Partition
Fully Synchronous	Highest (no loss unless all replicas fail)	High (wait for all/slowest)	Strong (linearizable)	Low (writes may be unavailable if quorum lost)
Semi-Synchronous	High (one replica confirmed)	Moderate (wait for fastest replica)	Strong (leader reads)	Moderate (fallback to async)
Quorum (W < N)	Moderate (W replicas)	Low to Moderate (depends on W)	Eventual or Strong (if R+W>N)	High (can tune)
Fully Asynchronous	Low (potential loss of recent writes)	Very Low	Eventual	Very High

5. Real-World Case Studies: When Replication Choices Matter

Case 1: GitHub’s MySQL Outage (2018)

In October 2018, GitHub experienced a major outage that lasted about 24 hours. The root cause was a failure in their MySQL asynchronous replication topology. A network partition caused the replication lag to grow uncontrollably, and when the partition healed, the replicas attempted to replay a huge backlog of binary logs, overwhelming the database and leading to cascading failures.

GitHub’s architecture relied on asynchronous replication for scaling reads and for disaster recovery across data centers. The incident highlighted the dangers of assuming asynchronous replication will always keep up. If they had used synchronous replication, the system might have remained consistent but could have become unavailable earlier. The lesson: monitor replication lag aggressively and have mechanisms to shed load or throttle writes.

Case 2: Amazon DynamoDB and the 2017 S3 Outage

In February 2017, Amazon S3 (Simple Storage Service) experienced a severe outage in the US-EAST-1 region. The root cause was a typo in a command that took down more subsystems than intended. While S3 is not a database per se, its replication model is similar to DynamoDB’s: asynchronous replication across Availability Zones (AZs). During the outage, it was discovered that some objects stored shortly before the failure were replicated to only one AZ and were inaccessible for several hours.

Amazon later introduced S3 Object Lambda and stronger replication guarantees (S3 Replication Time Control and S3 Batch Replication) to allow customers to enforce synchronous replication for critical data. This case illustrates that even the largest cloud providers must carefully design replication strategies for different durability tiers.

Case 3: Financial Systems and the Need for Synchronous Replication

In the world of banking and financial exchanges, data loss is not an option. The SWIFT network, for example, requires that a payment message is logged on at least two physically separate systems before it is considered processed. Many core banking systems use synchronous replication between primary and backup data centers, often through specialized hardware (e.g., EMC VPLEX or IBM GDPS). Latency is higher—sometimes 50-100 ms for cross-country replication—but that is acceptable when moving millions of dollars.

In high-frequency trading (HFT), however, even that latency is too high. Trading firms often use co-location and write to a single node in memory, then asynchronously replicate to a slower disk. If the node crashes, they lose the last few microseconds of trades, but statistical risk models account for such losses. The trade-off is extreme: they prioritize speed over perfect durability.

6. The Future: Strong Consistency at Global Scale

For decades, the conventional wisdom was that you could either have strong consistency or high performance, not both, especially across geographic distances. However, recent advances are changing this:

Cloud Spanner and CockroachDB show that with careful engineering (clock synchronization, consensus optimizations, and intelligent partitioning), you can achieve linearizable consistency with latencies in the tens of milliseconds across continents.
CRDTs (Conflict-Free Replicated Data Types) allow asynchronous replication without data loss, as long as the data structure is commutative and idempotent. Applications like collaborative editing (Google Docs) and multi-user databases (Redis CRDTs) use CRDTs to provide eventual consistency that converges automatically.
Tiered Storage and Hybrids: Systems are beginning to layer fast synchronous replication in a single region with asynchronous replication across regions. For example, Couchbase and MongoDB support multi-document transactions with read concern “linearizable” for a single region, but cross-region replication is asynchronous.

The choice between synchronous and asynchronous replication is no longer a binary decision. Modern databases offer fine-grained controls, allowing you to set consistency requirements per query or per table. The architect’s job is to understand the business requirements: what data is critical, what latency budget is acceptable, and what failure scenarios must be tolerated.

7. Conclusion: The Castle and the Courier

We began with the image of a steadfast castle and a fast-moving courier. In truth, real-world systems must be both. The castle’s walls protect the vault; the courier delivers the mail. A bank needs synchronous replication for its ledgers but might accept asynchronous replication for its customer-facing profile pictures.

The replication decision is not just about technology—it is about understanding the value of your data. A lost tweet is a minor annoyance; a lost medical record can cost a life. As you design or evaluate distributed systems, ask these questions:

What is the business cost of data loss?
What is the business cost of stale reads?
What are the latency SLAs for writes and reads?
How frequently do network partitions or node failures occur in my deployment?
Can we afford the operational complexity of consensus algorithms?

There is no one-size-fits-all answer. The spectrum of replication strategies—synchronous, asynchronous, semi-synchronous, quorum-based, hybrid—offers a palette for architects to paint the system that best meets their constraints.

Ultimately, the choice between synchronous and asynchronous replication is a choice about trust. Do you trust that your network will always be perfect? That your replicas will always be fast? That your leadership will survive? The wise architect trusts nothing and builds accordingly. Whether you build a castle or hire a courier, ensure that your system’s replication strategy aligns with the promises you make to your users. Because in the digital world, a minor glitch can quickly become a catastrophe—and the line is drawn by the replication model you choose.