Analyzing The Performance Of Two Phase Commit (2Pc) And Three Phase Commit (3Pc) In Geo Distributed Systems

2019-03-27 · Leonardo Benicio

Summary

A comprehensive technical exploration of analyzing the performance of two phase commit (2pc) and three phase commit (3pc) in geo distributed systems, covering key concepts, practical implementations, and real-world applications.

Contents

Here is the complete, expanded blog post. I have taken your excellent introduction and built upon it, diving deep into each section to reach the required length and depth, adding detailed examples, pseudocode, failure scenarios, and a rigorous performance analysis.

Title: The Cost of Consensus: Analyzing 2PC vs. 3PC Performance in a World Without Walls

Introduction

Imagine, for a moment, you’re booking a flight. You open your favorite app, search for a route from New York to Singapore, and find a perfect itinerary. You click “Book.” In that single, seemingly simple moment, a digital earthquake must occur. Your request hits a web server in Virginia, which must deduct $850 from your bank account—a bank whose primary database lives in a vault in Frankfurt. Simultaneously, it must reserve that specific seat on the Singapore Airlines flight, a record locked in a database in a data center in Tokyo. Finally, it must add your name to the passenger manifest, a table stored on a cluster in Sydney.

The application must either ensure that all four things happen (money is taken, seat is reserved, manifest is updated) or that none of them happen. You cannot have your money taken but the seat double-booked. You cannot have the flight confirmed but no record of payment. This is the fundamental problem of distributed systems: atomic commitment.

In the cozy, low-latency world of a single data center, this problem is solved with reasonable efficiency. We’ve built decades of engineering wisdom on top of protocols like Two-Phase Commit (2PC). But the modern internet is not single-datacenter. It is global, sprawling, and merciless. The latency between New York and Singapore is not 1 millisecond; it’s the better part of 200 milliseconds. The network is not a reliable, private LAN; it’s the open, unpredictable, and lossy public internet.

We are living in the era of geo-distributed systems. Global SaaS applications, financial trading networks, multiplayer gaming platforms, and multi-cloud data fabrics all demand that operations span continents. This shift from a local, reliable network to a global, unreliable one does not just make our protocols slower; it fundamentally breaks some of their core assumptions. The blockades are not physical walls between data centers, but walls of latency, the risk of network partitions, and the terrifying specter of a coordinator crashing mid-protocol.

The classic solution to atomic commitment, taught in every distributed systems course, is Two-Phase Commit (2PC). Its younger, more sophisticated sibling, Three-Phase Commit (3PC), emerged to solve 2PC’s most critical flaw: its vulnerability to blocking when the coordinator fails. The textbook narrative is clear: 2PC is simpler but blocking; 3PC is more complex but non-blocking.

But is this binary distinction sufficient for the modern geo-distributed world? The cost of “consensus” is not paid in lines of code alone. It is paid in milliseconds, in throughput, in user experience. A protocol that is theoretically non-blocking can be practically unusable if its performance characteristics are catastrophic in a high-latency, parti- prone environment.

This post is a deep dive into that cost. We will move beyond the textbook orthodoxy to perform a rigorous, performance-oriented analysis of 2PC and 3PC. We will dissect their messaging overhead, their blocking behavior not just as a theoretical failure-mode but as a performance degradation under stress, and their susceptibility to the brutal physics of the speed of light. By the end, you will have a nuanced understanding that a “better” protocol is not always the right one, and that in the real world, the cost of consensus often dictates the architecture.

Part 1: The Protocols in Depth – A Performance Lens

Before we can analyze performance, we must have a crystal-clear mental model of how each protocol works, focusing specifically on the steps that consume resources: messages, timeouts, and synchronization points.

1.1 Two-Phase Commit (2PC): The Workhorse and Its Chain

2PC is elegant in its simplicity. It introduces a coordinator and a set of participants. The goal is for all participants to agree on the final outcome: commit or abort. The protocol proceeds in two distinct phases.

Phase 1: Voting (or Prepare)

The coordinator sends a PREPARE message (often called a “VOTE-REQUEST”) to all participants. This message includes the transaction ID and the data to be committed.
Upon receiving PREPARE, each participant does the following:
- It checks if it can execute the transaction (e.g., does the account have enough funds? Is the seat available?).
- It writes the transaction’s changes and its own status (READY or ABORT) to a durable log (a file on disk). This is the “prepare” step. The participant is now in a “READY” state, uncertain of the final outcome.
- It sends a message back to the coordinator: either a YES (ready, prepared to commit) or a NO (cannot commit, wants to abort).

Phase 2: Decision (or Commit/Abort)

The coordinator collects all responses.
- SCENARIO A (Commit): If the coordinator receives a YES from all participants, it decides to commit. It writes a COMMIT decision to its own durable log.
- SCENARIO B (Abort): If the coordinator receives even a single NO, or if it times out waiting for a response, it decides to abort. It writes an ABORT decision to its log.
The coordinator then sends the final decision (either a COMMIT or ABORT message) to all participants.
Upon receiving the decision, each participant updates its own state (commits or aborts the transaction), writes the final outcome to its durable log, and sends an ACK back to the coordinator.
Once the coordinator has received ACKs from all participants, it considers the transaction complete and can forget about it.

Performance Bottlenecks of 2PC:

Synchronous I/O: The PREPARE phase requires a durable log write on every participant. This is one of the most expensive operations in a database. In a geo-distributed system, this write happens on a disk in Sydney, Frankfurt, Tokyo, etc., introducing massive latency at the transaction’s very first step.
Blocking on the Coordinator: The participants are in a “READY” state, holding locks on resources (e.g., the seat, the money). They are completely blocked, waiting for the coordinator’s decision. If the coordinator crashes after Phase 1, the participants are stuck in this “READY” state for an indefinite period. They cannot unilaterally decide to commit or abort because they don’t know the coordinator’s final decision. They don’t know if other participants voted NO and the coordinator was about to send an ABORT. This is the blocking problem. They will hold those locks until the coordinator comes back, or a human operator manually intervenes.
Network Rounds vs. Latency: The protocol requires a minimum of two network round trips (PREPARE/YES, then DECISION/ACK). In a geo-distributed system, each round trip can take 100ms-300ms. This immediately sets a floor on the transaction’s latency.

1.2 Three-Phase Commit (3PC): The Non-Blocking Promise

3PC was explicitly designed to solve the blocking problem of 2PC. It introduces a third, intermediate “pre-commit” phase to break the chain. The key insight is that if a participant has enough information, it can unilaterally decide the outcome even if the coordinator fails.

Phase 1: Can-Commit

This is identical to 2PC’s Phase 1.

Coordinator sends CAN-COMMIT? to all participants.
Participants check viability, log their state, and respond with YES or NO.
If any participant responds NO, the coordinator aborts immediately.

Phase 2: Pre-Commit

This is the new, critical phase.

If all participants voted YES, the coordinator knows it is safe to eventually commit. It sends a PREPARE-TO-COMMIT (often just called PRE-COMMIT) message to all participants.
Crucially, the coordinator writes this decision to its log. This is the point of no return for the coordinator.
Upon receiving PREPARE-TO-COMMIT, each participant writes this decision to its own durable log. It is now in a “PRE-COMMIT” state.
Each participant sends an ACK back to the coordinator.

Phase 3: Do-Commit

This is the actual commit phase.

Once the coordinator has received ACKs for the PREPARE-TO-COMMIT from all participants, it sends a DO-COMMIT message.
Participants receive DO-COMMIT, write the final commit to their log, and send a final ACK.

The Non-Blocking Mechanism:

The magic is in the PRE-COMMIT phase. The PREPARE-TO-COMMIT message acts as a guarantee. It tells all participants: “I (the coordinator) have decided that we will commit. This is a done deal.”

Now, consider a coordinator crash after it has sent PREPARE-TO-COMMIT but before it sends DO-COMMIT. A participant that is stuck in the PRE-COMMIT state can recover by consulting its peers.

The Protocol Rule: A participant in the PRE-COMMIT state knows that the final decision is to commit.
The Recovery: If a participant times out waiting for the DO-COMMIT from a crashed coordinator, it queries the other participants. If any other participant is also in the PRE-COMMIT state (or has already committed), the protocol can agree to commit. This is because they know the coordinator had made the decision to commit before it died. If no participant is in PRE-COMMIT, they know the coordinator must have aborted (or never started Phase 2), and they can safely abort.

Performance Costs of 3PC:

Extra Round Trip: 3PC requires three network round trips (CAN-COMMIT/YES, PREPARE-TO-COMMIT/ACK, DO-COMMIT/ACK), compared to 2PC’s two. In a geo-distributed system, this extra 100-300ms is a massive performance tax.
More Log Writes: The coordinator must write its decision to its log in Phase 2, and participants must write their PRE-COMMIT state. More durable writes mean more I/O overhead.
Complexity Overhead: The recovery protocol is significantly more complex. A participant cannot just wait for the coordinator; it must initiate a peer-to-peer consensus protocol to resolve the uncertainty. This adds latency and overhead during failure recovery.

Part 2: A Deeper Performance Analysis

Let’s now build a model to quantify these performance differences.

2.1 Latency vs. Timeout: The Geo-Distributed Trap

The most common argument against 3PC in geo-distributed systems is its latency. A single transaction with 2PC has a minimum latency of 2 * (max latency) plus processing time. With 3PC, it’s 3 * (max latency). This seems like a clear loss for 3PC.

But the story is more nuanced. The biggest practical problem in geo-distributed systems is not the average latency, but the tail latency and the handling of timeouts.

2PC’s Timeout Problem: In 2PC, the coordinator sets a timeout for the Phase 1 responses. If a participant in Tokyo is slow (a network hiccup adds 500ms), the coordinator in Virginia might abort the entire transaction. This abort is correct, but it’s a performance failure – a long wait followed by a negative outcome. The transaction must be retried, doubling the latency.
3PC’s Timeout Problem: In 3PC, the coordinator also has timeouts. However, the critical recovery happens during the participant’s timeout. If a participant in Tokyo times out waiting for the DO-COMMIT, it can contact its peers. This peer-querying adds another round trip during failure. The recovery latency can be catastrophic.

The Hidden Cost of 2PC’s Blocking:

The blocking problem of 2PC is often dismissed as a failure-mode that requires manual intervention. This is wrong. The blocking problem is a performance problem, especially under stress.

Imagine a scenario where the coordinator crashes temporarily (e.g., a 30-second restart).

2PC: All participants that are in the “READY” state are now blocked. They hold locks on critical database resources (e.g., a table row for a popular item, a seat on a flight). Any other transaction that needs to access those resources must now wait. This creates a cascade of blocked transactions, effectively halting the system’s throughput on those resources. The system’s performance doesn’t just degrade to 0; it degrades to a negative, requiring manual intervention to clear deadlocks.
3PC: Participants in the “PRE-COMMIT” state can recover. They can query each other, re-form a consensus, and decide to commit (or abort). The transaction completes in seconds, not minutes. The locks are released. The system’s throughput remains high. The failure is handled gracefully.

Table 1: Performance Characteristics Under Failure

Protocol	Latency (Normal)	Latency (Failure/Bottleneck)	Recovery Performance	Impact on Throughput
2PC	`2 * RTT`	Abort + Retry -> `4 * RTT`	Blocked (Indefinite)	Degrades to 0 on resource
3PC	`3 * RTT`	Delayed + Peer Query -> `4 * RTT`	Automatic (Fast)	Slight, temporary dip

2.2 The Cost of Uncertainty: A Detailed Example with Pseudocode

Let’s analyze the critical “No Knowledge” state in 2PC. This is the state a participant is in after sending YES but before receiving the decision. It is an indefinite state of uncertainty.

Pseudocode for 2PC Participant (the problem):

def handle_prepare(transaction):
    if can_perform_transaction(transaction):
        write_to_durable_log('PREPARED', transaction.id)
        send_to_coordinator('YES', transaction.id)
        # The participant is now in a state of "No Knowledge"
        # It is holding locks, waiting for a COMMIT or ABORT from the coordinator.
        # IT CANNOT DO ANYTHING UNTIL IT HEARS FROM THE COORDINATOR.
        wait_for_decision(transaction.id) # This could block forever on a crash
        if decision == 'COMMIT':
            commit_changes(transaction)
        else:
            abort_changes(transaction)
        release_locks(transaction)
    else:
        send_to_coordinator('NO', transaction.id)
        abort_changes(transaction)

The Performance Implication: The wait_for_decision function is a blocking call. It holds resources hostage. In a busy system with thousands of transactions per second, even a single coordinator crash can cause a massive pile-up. The system’s performance degrades as a power function of the number of blocked transactions.

How 3PC Avoids This:

# 3PC Participant
def handle_can_commit(transaction):
    if can_perform_transaction(transaction):
        write_to_durable_log('READY', transaction.id)
        send_to_coordinator('YES', transaction.id)
        # Participant is now in "No Knowledge" (same as 2PC)
    else:
        send_to_coordinator('NO', transaction.id)

def handle_pre_commit(transaction):
    # This message confirms that the coordinator has decided to commit.
    write_to_durable_log('PRE_COMMIT', transaction.id)
    send_to_coordinator('ACK')
    set_timeout_for_do_commit(transaction.id, timeout_value)

def handle_timeout(timeout_id):
    # The coordinator might be dead.
    # Initiate the recovery protocol.
    for peer in all_participants:
        if peer == self:
            continue
        state = ask_peer_for_state(peer, transaction.id)
        if state == 'PRE_COMMIT' or state == 'COMMITTED':
            # We know the coordinator wanted to commit.
            commit_changes(transaction)
            release_locks(transaction)
            return
    # No peer is in PRE_COMMIT, so the coordinator must have aborted.
    abort_changes(transaction)
    release_locks(transaction)

The performance gain is clear. A timeout in 3PC doesn’t mean indefinite waiting; it triggers a deterministic recovery procedure that is bounded in time by the timeout_value + (2 * RTT).

2.3 The Isolation Problem: Why Locks Hurt More in a Distributed World

Locks are the enemy of performance in any system, but in a distributed system, they are a catastrophe. Both 2PC and 3PC typically rely on two-phase locking (2PL) for isolation. This means that a participant must acquire all its locks before it can vote YES.

Consider the flight booking example.

Participant A (Frankfurt Bank): Locks the user’s bank account.
Participant B (Tokyo Airlines): Locks the specific seat.
Participant C (Sydney Manifest): Locks the passenger manifest row.

In a single data center, these locks are held for tens of milliseconds. In a geo-distributed system, they are held for the duration of the entire consensus protocol—potentially hundreds of milliseconds or more. This dramatically increases the contention window.

The effect on performance:

2PC: The lock is held from the moment the participant receives PREPARE until it receives COMMIT/ABORT. This is ~1 * RTT (for the vote) + ~1 * RTT (for the decision) + processing. If a user tries to book the same seat from a different app, that second request will be blocked for this entire time. This leads to high contention and poor throughput.
3PC: The lock is held for one extra RTT during the PRE-COMMIT phase. The lock duration is ~1 * RTT + ~1 * RTT + ~1 * RTT. This is 50% longer than 2PC, making it significantly worse for workloads with high contention.

Summary of Performance Trade-offs:

Characteristic	2PC	3PC
Normal-Case Latency	Lower (2 RTT)	Higher (3 RTT)
Contention Window	Shorter	Longer
Failure Recovery Latency	Indefinite (Blocking)	Bounded (Fast)
Failure Impact on Throughput	Catastrophic (0 to negative)	Manageable (Temporary dip)
Network Overhead	Lower (2 N messages)	Higher (3 N messages)
Complexity (Code)	Simpler	More Complex
Complexity (Debugging)	Simpler (but blocked state is hard to fix)	Complex (recovery protocol is hard to get right)

Part 3: Modern Implementations and Their Performance Realities

The textbook protocols are rarely implemented verbatim. Real-world systems make pragmatic choices.

3.1 MySQL Group Replication (GR)

MySQL Group Replication (GR) is a popular solution for high-availability in a single cluster. It is not a pure 2PC or 3PC protocol. It uses a Paxos-based consensus (via its XCOM protocol) to agree on the order of writes.

How it Works: When a server wants to commit a transaction, it broadcasts the write and a PREPARE message to the group. The group uses Paxos to reach a consensus on which transaction to commit next. Once the consensus is reached, all nodes are told to commit.
Performance Analogy: This is like a very fast, efficient 3PC. The PREPARE (Can-Commit) and the PREPARE-TO-COMMIT (the Paxos consensus) are effectively combined.
Geo-Distributed Reality: In a single data center, GR is fast. Across continents, the latency of the Paxos rounds (which require multiple message passes) becomes the bottleneck. It struggles with high-latency links. This is why GR is typically deployed within a single region.

3.2 Google Spanner and TrueTime

Google Spanner is the gold standard for geo-distributed transactions. It famously uses TrueTime, a hardware-assisted global clock, to provide external consistency.

How it Works: Spanner does not use 2PC or 3PC in the traditional sense. It uses a Paxos-based system for each shard, and then a Percolator-like protocol (based on 2PC) to coordinate transactions across multiple shards. The key difference is that TrueTime provides a precise timestamp, t, for each transaction.
Performance Advantage: When a participant in Spinner’s 2PC votes YES, it includes a timestamp. The schema for the commit/abort decision is: “Only commit if the timestamp t is in the past and all participants voted YES.” This eliminates the indefinite blocking problem of 2PC in a clever way. If the coordinator crashes, a participant can use its TrueTime clocks to unilaterally decide to commit or abort after a certain timeout. This is a non-blocking 2PC implementation.
The TrueTime Cost: The price is a requirement for a complex, expensive, and specialized hardware (atomic clocks and GPS receivers in every data center). This is not an option for most businesses.

3.3 Practical Examples: When to Choose What

Scenario 1: Global Inventory Management (Low Contention, High Latency)
- You have a system that updates inventory for a small number of items (e.g., luxury goods) across 3 continents. Contention is low. Latency is the primary user-facing metric.
- Winner: 2PC. The 50% lower latency in the normal case is more important than the rare, catastrophic blocking failure. You can design your system to tolerate blocking (e.g., by having a crash-recovery mechanism that automatically commits prepared transactions).
Scenario 2: High-Frequency Trading (Extreme Contention, Low Latency)
- You need to update a shared ledger for a popular stock. Contention is incredibly high. Locks must be held for the absolute minimum time.
- Winner: 2PC (with optimistic concurrency control). 2PC is faster per round trip. Furthermore, you might avoid locking altogether by using a different concurrency control mechanism (e.g., optimistic concurrency control), reducing the impact of the locking problem.
Scenario 3: Financial Settlement System (High reliability, Moderate Latency)
- You must guarantee that a payment is either fully settled or fully rejected. A blocked state lasting minutes or requiring manual intervention is a nightmare that can lead to financial loss and regulatory problems.
- Winner: 3PC (or a Paxos-based system). The ability to automatically recover from failures, even if it is a bit slower, is a strict requirement. The non-blocking property is not a nice-to-have; it is a business necessity.
Scenario 4: Global SaaS Multi-Tenant Database (Mixed Workload)
- You have a complex application with thousands of users, many transactions, and a mix of read/write loads. You need high throughput and low latency.
- Winner: Neither (use a different architecture). The performance cost of any atomic commit protocol is too high for a general-purpose workload. Instead, you should design your data model to minimize cross-shard transactions. Use data locality (keeping all related data in a single region) and compensating transactions or sagas for operations that must span regions.

Conclusion: The Real Cost of Consensus

We have journeyed beyond the textbook. The simple story—2PC is blocking, 3PC is non-blocking—is a dangerous oversimplification for the geo-distributed world.

The real cost of consensus is not just the number of rounds; it is the interplay of latency, contention, and failure recovery.

2PC is the performance king in the normal case, but it is a fragile king. Its blocking nature under failure is a performance catastrophe waiting to happen. It is the right choice when failures are rare, contention is low, and you can afford the occasional, manual recovery operation.
3PC is the robust, reliable workhorse, but it is slower. Its non-blocking property is a performance safety net, preventing minor failures from cascading into system-wide outages. It is the right choice when failure recovery speed is paramount, and you can tolerate the extra latency in most transactions.

The most performant system, however, is often the one that avoids the problem altogether. The rise of modern architectures like CQRS (Command Query Responsibility Segregation) and Event Sourcing is a direct response to this cost of consensus. By designing systems that operate on events and eventual consistency, we can achieve a global scale without the crippling overhead of synchronous atomic commitment.

Ultimately, the choice between 2PC and 3PC is a fundamental architectural decision. It is a bet you place on the reliability of your infrastructure versus the patience of your users and the speed of light. There is no one-size-fits-all answer. The true mark of a skilled engineer is not knowing which protocol is “better,” but understanding the trade-offs and choosing the right cost to pay for the consensus you need.