The Realities Of Implementing A Distributed Transaction Protocol: Percolator Vs. Spanner

A comprehensive technical exploration of the realities of implementing a distributed transaction protocol: percolator vs. spanner, covering key concepts, practical implementations, and real-world applications.
Contents
The Reality of Distributed Transactions: Percolator vs. Spanner
Part 1: The Lie We Tell Ourselves
It begins, as it so often does in distributed systems, with a lie.
We tell ourselves that the database is a single, reliable, monolithic entity. We imagine a fortress of data, safe behind a single set of walls, where updates are atomic, consistent, isolated, and durable. This is the comfortable, ACID‑compliant myth we carry from the era of single‑machine databases. It is a beautiful lie, but a lie nonetheless.
The reality of modern scale is that the fortress has been razed and rebuilt as a sprawling, decentralized city. Your data doesn’t live in one place. It lives in a thousand, spread across racks, across data centers, across continents. The single machine’s lock manager, its fail‑safe transaction log, its reliable clock—these foundational pillars of the classic database are gone. In their place, we have a chaotic, asynchronous, unreliable substrate: commodity servers connected by a network that can partition, reorder, or lose messages, and governed by clocks that drift and disagree.
And yet, we still demand the lie. We demand the ability to transfer $100 from Account A to Account B, and we demand that this operation either happens completely or not at all, that we never see the money disappear from A without appearing in B, and that a concurrent read sees a consistent snapshot of both accounts. This demand, in the face of this chaotic reality, is the fundamental challenge of the distributed transaction.
For years, the standard answer to this challenge was to simply give up on the ideal. The industry settled for a weaker consistency model—often eventual consistency—trading atomicity for availability and performance. This works for a surprising number of applications (your social media feed doesn’t need perfect transactional integrity), but it is a complete non‑starter for the lifeblood of the modern digital economy: finance, inventory management, booking systems, and any application where one inconsistent read could lead to double charges, negative inventory, or corrupt bank balances.
Then came two landmark systems from Google that dared to chase the lie. Percolator (circa 2010) and Spanner (circa 2012) both offered to bring ACID transactions to the planet‑scale distributed world, but they took radically different paths. Percolator—a pragmatic hack that layered transactions on top of Bigtable’s eventual consistency—was a brilliant, clever hack that paid for its transparency with latency and complexity. Spanner—a ground‑up redesign of the database itself—introduced the radical idea of using GPS and atomic clocks to achieve external consistency across the globe.
Understanding the realities of implementing these two systems—not just the white‑paper theory, but the gritty, real‑world compromises—is essential for any engineer who must choose between them, or who wants to understand what distributed transactions actually cost.
In this post, we will tear apart both systems. We will look at their architecture, their protocols, their failure modes, and the hidden engineering nightmares that the papers gloss over. We will explore the trade‑offs that were made, the assumptions that were baked in, and the lessons that apply to any distributed system you might design or deploy today.
By the end, you will understand why there is no free lunch in distributed transactions, and why the choice between Percolator and Spanner is ultimately a choice between two different kinds of suffering.
Part 2: The Dream of Distributed Transactions
Before diving into the specifics, we need a common language. What do we mean by a “distributed transaction”? And why is it so hard?
A transaction, in the classical sense, provides four properties—atomicity, consistency, isolation, durability—that we all know as ACID. In a single‑machine system, these are enforced by the database engine using locks, write‑ahead logs, and a single clock. When we distribute the data across many machines, each running its own database engine, we lose the ability to coordinate these properties with a single point of control.
The core difficulty is that any operation that touches multiple machines must now deal with three fundamental enemies:
- Partial failures: The network can partition; a machine can crash mid‑transaction. We must handle the case where one participant commits and another fails.
- Clock skew: Two machines’ clocks can disagree by milliseconds, seconds, or even minutes in the worst case. This makes it nearly impossible to order events in a globally consistent way.
- Network delay: Messages can be delayed, reordered, or lost. We cannot assume a synchronous round‑trip.
The holy grail of distributed transactions is serializability—the guarantee that the outcome of concurrent transactions is equivalent to some sequential execution. Even stronger is external consistency (also called linearizability for operations), which ensures that if transaction T1 commits before T2 starts, then T2’s effects appear after T1’s.
Both Percolator and Spanner aim for serializability. Spanner goes further and achieves external consistency. But both pay a price.
To understand that price, we first need to examine the building blocks that these systems use: two‑phase commit, two‑phase locking, timestamp ordering, and snapshot isolation.
Two‑Phase Commit (2PC)
Two‑phase commit is the classic protocol for ensuring atomicity across multiple participants. It works in two phases:
- Prepare phase: The coordinator sends a “prepare” message to all participants. Each participant votes yes or no (it can vote no if it cannot guarantee that it can commit, e.g., due to a conflict or a transient failure). If any participant votes no, the coordinator aborts.
- Commit phase: If all votes are yes, the coordinator sends a “commit” message; if any no, it sends “abort.” Participants execute the commit or rollback.
The problem with 2PC is that it is blocking: if the coordinator fails after sending the prepare messages but before sending the commit/abort, the participants are stuck holding locks (or prepared state) until the coordinator recovers. This can hold up other transactions indefinitely. This is why many distributed systems either avoid 2PC or use it with a highly available coordinator (like Chubby or ZooKeeper).
Both Percolator and Spanner use a variant of two‑phase commit, but they handle the blocking problem in very different ways.
Two‑Phase Locking (2PL)
Two‑phase locking is the most common way to achieve serializability via locks. The rule is simple: all lock acquisitions must occur before any lock releases. Typically, this means that a transaction acquires all its locks during its execution (growing phase) and releases them only at commit or abort (shrinking phase). This ensures that no other transaction can see intermediate states.
The downside is that 2PL is prone to deadlocks and can severely limit concurrency. It also requires a central lock manager if data is partitioned across many machines—an unrealistic requirement at global scale.
Percolator uses a lock‑based approach but distributes the locks alongside the data rows. Spanner uses a conservative form of locking enhanced by timestamp ordering.
Timestamp Ordering and Multi‑Version Concurrency Control (MVCC)
An alternative to locking is to assign each transaction a timestamp and use that timestamp to determine serialization order. In MVCC, each read sees a consistent snapshot as of a particular timestamp. Writes are assigned later timestamps. If two transactions conflict, the one with the later timestamp may need to abort.
Timestamp ordering avoids many locking overheads but requires a global, monotonically increasing timestamp source—which is difficult to implement in a distributed system with clock skew.
Spanner solves this with TrueTime, a clock that exposes an uncertainty bound. Percolator, lacking such a clock, uses a different trick: it uses timestamps from the underlying Bigtable (which are loosely synchronized) and avoids clock dependency for ordering by using locks.
Now, with these concepts in mind, let’s get into the weeds.
Part 3: Percolator—The Clever Hack on Bigtable
Percolator was built at Google to replace the MapReduce‑based indexing pipeline that rebuilt the web search index every few days. The goal was to incrementally update the index—i.e., to apply small changes as web pages changed, rather than reprocessing the entire web. This required distributed ACID transactions over billions of rows scattered across Bigtable.
Bigtable was (and still is) a distributed key‑value store that provides eventual consistency across rows. Individual row transactions are atomic, but multi‑row transactions are not supported. Percolator layers transactions on top by using Bigtable as a storage substrate and adding a separate lock service (Chubby) for coordination.
The result is a system that works well for latency‑tolerant, batch‑oriented workloads—but is far from real‑time.
Architecture Overview
Percolator’s design is deceptively simple. Each Bigtable cell (column in a row) is extended with three hidden columns that implement the transaction protocol:
- Lock: Contains the lock metadata (transaction ID, timestamp, etc.) or is empty.
- Write: Contains the timestamp of the committed write.
- Data: Contains the actual value.
A transaction reads by first attempting to acquire a snapshot timestamp from an Oracle (a timestamp server) that provides strictly increasing timestamps. Then it reads the most recent committed version of the data that is before its snapshot timestamp. If a lock is encountered during read, the transaction may choose to resolve the lock (either by waiting, aborting the locking transaction if it is dead, or pushing its own timestamp forward).
Writes are performed in two phases, similar to two‑phase commit, but with a twist: the transaction’s state is stored directly in the data, not in a separate coordinator.
The Commit Protocol in Detail
Assume a transaction T wants to write to two rows, R1 and R2. The steps are:
Pre‑write (Prepare):
For each row that will be modified, T writes a lock (with its transaction ID) to the Lock column, and writes the tentative value to the Data column. The locks are written using conditional mutations: if a lock already exists, the pre‑write fails and T must abort.Primary lock:
One row (chosen as the “primary”) is treated specially. Its lock serves as the truth about whether T is committed or aborted. If the primary lock exists, the transaction is still in progress; if it is replaced with a write record, the transaction is committed. All other locks are secondary.Commit (phase 2):
T sends a commit request to the primary. The primary’s lock is replaced with a write record (commit timestamp from the Oracle). This is the atomic commit point. Once the primary is committed, the transaction is considered committed globally, even if secondary locks haven’t yet been cleaned up.Secondary commit:
T then asynchronously replaces the secondary locks with write records. This can be done lazily.Rollback:
If any pre‑write fails, or if the transaction times out, it can roll back by deleting its locks. However, if T crashes after writing some locks but before committing, a recovery process (a “cleaner”) must detect incomplete transactions and either commit or abort them based on the primary’s state.
Why This Works (and Where It Breaks)
The cleverness of Percolator is that it avoids a centralized transaction coordinator by making each transaction’s primary lock the authority. If the coordinator crashes, any participant can check the primary to decide the fate of the transaction. This is a form of distributed consensus without a separate consensus protocol—it piggybacks on Bigtable’s strong consistency within a single row.
But this elegance comes with steep realities:
- Latency: Each pre‑write and commit round‑trip involves multiple Bigtable RPCs (even with multiplexing). In typical deployments, a two‑row transaction can take 10–15 milliseconds even under light load. Under contention, conflicts cause aborts and retries that multiply the cost.
- Contention: The protocol is extremely sensitive to lock conflicts. If two transactions try to modify overlapping rows, one will block or abort. Since locks are stored in the row itself, reading a row while a lock exists forces the reader to wait or resolve.
- Lazy clean‑up: After a transaction commits, secondary locks may linger if the client crashes before cleaning them. Subsequent readers must resolve those locks—they become the “cleaner” on the fly. This adds latency to reads.
- Timestamp Oracle: The Oracle must be a strictly monotonic timestamp generator. It is a single point of failure (though replicated via Paxos). If the Oracle stalls, all transactions stall.
- Cascading rollbacks: If a transaction aborts early, any transaction that already read data that will be rolled back may need to be aborted as well. Percolator handles this via a “rollback” timestamp, but it adds complexity.
Real‑World Implications for Google Search
Percolator was used for Google’s incremental indexing pipeline. In that workload, the system processes batches of millions of updates per day, but each update is typically small (a few rows). Latency of tens of milliseconds per transaction is acceptable because indexing is an offline, background job. The system is also highly fault‑tolerant: if a transaction fails, it can be retried later.
However, Percolator would be a terrible choice for a real‑time payment system. The high latency and lock contention would make it impossible to sustain thousands of transactions per second across hot rows.
Code Example: Pseudo‑Percolator Commit
Let’s illustrate the commit protocol with a simplified Go‑like pseudocode:
type Transaction struct {
ID int64
Snapshot int64
Primary string
Writes []RowWrite
}
func (t *Transaction) Prepare(ctx Context) error {
// Phase 1: pre‑write all rows
for _, w := range t.Writes {
err := Bigtable.ConditionalMutation(w.Row, func(cell) {
if cell.Lock != nil {
return ErrLockConflict
}
cell.Lock = &Lock{TxnID: t.ID, Primary: t.Primary}
cell.Data = w.Value
})
if err != nil {
return err
}
}
return nil
}
func (t *Transaction) Commit(ctx Context) error {
// Phase 2a: commit primary
err := Bigtable.Write(t.Primary,
map[string]interface{}{
"Lock": nil, // remove lock
"Write": &Write{Timestamp: Oracle.Next()}, // commit timestamp
},
)
if err != nil {
return err
}
// Phase 2b: asynchronously commit secondaries
for _, w := range t.Writes {
if w.Row == t.Primary { continue }
go Bigtable.Write(w.Row, map[string]interface{}{
"Lock": nil,
"Write": &Write{Timestamp: commitTs},
})
}
return nil
}
This omits error handling, retries, and lock resolution during reads, but captures the essence.
Failure Modes
What happens if a transaction crashes after pre‑writing but before committing? A subsequent reader encountering the lock will check the primary:
- If the primary has a write record → the transaction is committed → the reader can clean the secondary lock by replicating the write record.
- If the primary still has a lock → the transaction is still in progress (or dead) → the reader waits or tries to abort the primary lock by checking a heartbeat timeout. If the lock is stale, the reader can “steal” the primary lock (write a rollback marker) and then abort the secondary locks.
This is a form of cooperative lock management. It works, but it means that every read might end up doing a write, increasing latency unpredictably.
Summary of Percolator
Percolator is a masterful hack that proved distributed transactions were possible on an eventually‑consistent store. It trades clean guarantees for pragmatic complexity. Its strengths are simplicity (no new storage engine needed) and reasonable performance for batch workloads with low contention. Its weaknesses are latency spikes due to lock resolution, sensitivity to hot spots, and dependence on a monotonic timestamp oracle.
It is the right tool when you need ACID across a few rows (tens to low hundreds) in a system that is already built on Bigtable or a similar key‑value store, and when millisecond‑scale latency is acceptable.
But if you need sub‑millisecond consistency across thousands of machines and millions of rows per second, Percolator will break. For that, you need Spanner.
Part 4: Spanner—The Reinvention of the Database
Spanner is Google’s globally distributed, synchronously replicated, externally consistent database. It was designed from the beginning to provide ACID transactions at planetary scale, with an emphasis on strong consistency rather than availability at all costs. The key innovation was TrueTime, a clock infrastructure that uses GPS and atomic clocks to bound clock uncertainty.
Spanner is not a hack on top of another system. It is a full‑fledged distributed database with its own storage engine (Colossus), its own replication protocol (Paxos), and a custom transaction protocol that combines two‑phase locking with timestamp‑based MVCC.
The ambition of Spanner is staggering: to make a database that behaves like a single‑machine database, even when its data is spread across the globe.
TrueTime: The Foundation
Clock skew is the enemy of distributed systems. Without a reliable way to order events, you cannot guarantee external consistency. Most systems settle for using loosely synchronized clocks (via NTP) and accept that ordering might be wrong. Spanner refuses to accept this.
TrueTime provides an API with a simple contract:
TT.now()returns an interval[earliest, latest]that is guaranteed to contain the absolute time at the moment of the call.TT.after(t)is true when the current time is definitely aftert.TT.before(t)is true when the current time is definitely beforet.
The uncertainty interval (the difference between earliest and latest) is typically 1–7 milliseconds depending on proximity to GPS receivers and the quality of atomic clocks. This interval is used by Spanner to assign timestamps to transactions in a way that guarantees serialization order even across widely separated machines.
If two transactions T1 at node A and T2 at node B both want to commit, they assign themselves commit timestamps from their local TrueTime intervals. Suppose T1’s interval is [t1_start, t1_end] and T2’s is [t2_start, t2_end]. Spanner can order them deterministically even if the intervals overlap, as long as the commit protocol ensures that no transaction’s timestamp is finalized before it is safely committed.
The core rule is: a transaction’s commit timestamp must be greater than or equal to the current time (i.e., after the uncertainty interval). In practice, Spanner uses a “wait” phase after Paxos commits to ensure that the timestamp is in the past.
Architecture Layers
Spanner organizes data into directories (sets of rows that share a common prefix), which are the unit of replication and movement. Each directory is replicated across a set of Paxos groups, typically spanning multiple zones or regions. The transaction protocol operates at the directory level: transactions that touch only one directory can be committed with a single Paxos round‑trip. Multi‑directory transactions (distributed transactions) require a coordinator‑driven two‑phase commit across Paxos groups.
Single‑Directory Transactions (Paxos)
For writes to a single directory, Spanner uses a distributed commit protocol that looks like a classic Paxos write:
- The client sends a write request to the leader of the Paxos group for that directory.
- The leader executes transaction logic, acquires locks (shared for reads, exclusive for writes), and then proposes the write to the Paxos group.
- Paxos collects a majority quorum and commits the value.
- The leader assigns a commit timestamp using TrueTime: it ensures the timestamp is later than the current time (waiting if necessary) and not overlapping with any ongoing transactions.
This yields commit latencies of a few milliseconds within a region (dominated by Paxos and the TrueTime wait), which is much lower than Percolator’s two‑phase commit latency for a single row.
Multi‑Directory Transactions (Two‑Phase Commit with Paxos)
When a transaction spans multiple directories, the protocol becomes more involved. Each directory has a leader. One directory is designated the coordinator. The steps are:
- Lock acquisition: The client executes reads and writes using two‑phase locking. Writes are buffered locally.
- Prepare phase: The client sends prepare messages to each participant leader. Each leader acquires locks and ensures it can commit (e.g., reads are stable, no conflicts). Then it prepares the value locally but does not commit. It returns a Promise containing a timestamp (its local TrueTime) that represents the earliest possible commit time for that participant.
- Commit phase: The coordinator collects all promises. It picks a commit timestamp that is:
- Greater than all participant promises.
- Greater than or equal to the coordinator’s own TrueTime
now().latest. - After waiting for the TrueTime uncertainty to ensure the timestamp is in the past (this is the crucial step).
- Coordinator commit: The coordinator commits the transaction by writing a Paxos record for its own directory with the chosen timestamp. This is the atomic commit point.
- Participant commit: The coordinator sends the final timestamp to all participants. Each participant commits its own directory using that timestamp (again via Paxos). If any participant fails, the transaction is aborted.
The key insight: the coordinator’s wait ensures that the chosen commit timestamp is guaranteed to be less than any future read that starts after the transaction commits (because reads use TT.now().earliest). This provides external consistency.
External Consistency Verification
Consider two transactions T1 and T2. T1 commits with timestamp ts1. T2 starts after T1 commits. For T2 to see T1’s effects, we need to guarantee that T2’s read timestamp is greater than ts1.
T2’s read timestamp is TT.now().earliest (the minimum possible current time). T1’s commit timestamp is at least TT.now().latest at the time of commit (plus some overhead). Because T1 waits for its interval to fully elapse before finalizing the commit timestamp, we have:
ts1 ≤ TT.now().earliest at the moment T1 commits? Actually, the rule is: T1’s commit timestamp is chosen as max(participant_promises, coordinator.now().latest) + 1. Then the coordinator waits until TT.after(ts1) is true (i.e., the current time is definitely after ts1). So by the time T1 returns, any subsequent transaction’s read timestamp (which is TT.now().earliest) will be greater than ts1. Thus, serial order is preserved.
Real‑World Engineering Challenges
Spanner’s TrueTime is not magic. It requires physical infrastructure: GPS receivers in every datacenter, atomic clocks on each machine, and careful engineering to bound clock drift. Google has documented that many early failures were due to clock hardware malfunctions (e.g., GPS signal loss, clock battery failure). The system must degrade gracefully when TrueTime uncertainty increases (e.g., if a datacenter loses GPS, the uncertainty intervals grow, which increases commit latency).
Another challenge is read‑only transactions. Spanner uses snapshot isolation for reads: a read‑only transaction gets a timestamp from TrueTime and then reads all data as of that timestamp. These reads are lock‑free (they do not block writes) and can be served by any replica (not just the leader). However, to guarantee that the read sees a consistent snapshot, the replica must be “safe” (i.e., it must have applied all writes up to that timestamp). This requires a special protocol called safe time—the replica periodically computes the latest timestamp for which it is up‑to‑date. If a read’s timestamp is beyond the replica’s safe time, the read must wait.
Safe time is maintained via Paxos state and local clock bounding. This adds complexity but is essential for fast global reads.
Code Illustration: TrueTime‑Aware Commit
Below is a simplified pseudocode for a Spanner‑style multi‑directory commit, focusing on the TrueTime wait:
type Transaction struct {
ID UUID
Writes []Mutation
Participants []PaxosGroup
Coordinator PaxosGroup
}
func (t *Transaction) Commit() error {
// Phase 1: Prepare all participants
promises := []Promise{}
for _, p := range t.Participants {
promise, err := p.Prepare(t.ID, t.WritesForGroup(p))
if err != nil {
return err
}
promises = append(promises, promise)
}
// Phase 2: Choose commit timestamp
nowInterval := TrueTime.Now()
commitTs := max(
max(promises.map(p=>p.Timestamp)),
nowInterval.Latest,
) + 1
// Wait until TrueTime guarantees commitTs is in the past
for {
if TrueTime.After(commitTs) {
break
}
sleep(smallDuration)
}
// Phase 3: Coordinator commit (atomic point)
err := t.Coordinator.Commit(t.ID, commitTs, t.Participants)
if err != nil {
return err
}
// Phase 4: Asynchronously notify participants
for _, p := range t.Participants {
go p.CommitAt(commitTs)
}
return nil
}
The TrueTime.After loop is the crucial lock‑in of external consistency. Without it, a transaction could commit with a timestamp that another transaction might see as still in the future due to clock skew.
Latency and Throughput Realities
The TrueTime wait adds a modest but fixed overhead: typically 1–7 milliseconds per distributed transaction. For single‑directory transactions, the wait is minimal (the leader waits only for its own uncertainty, which is smaller because it can abort if the clock uncertainty is high). For multi‑directory transactions, the wait is the sum of the coordinator’s uncertainty plus any participant uncertainty contributions.
In practice, Google reports commit latencies for a multi‑directory transaction within the US in the 10–50 millisecond range, while cross‑continent transactions (e.g., US to Europe) can take 100–200 ms due to network round‑trips plus Paxos and TrueTime.
Throughput is limited by the lock granularity and Paxos groups. Spanner partitions data into directories; a single directory is limited to a few tens of gigabytes and a single Paxos leader can handle thousands of writes per second. However, hot directories (e.g., a single account that is frequently updated) become bottlenecks. Spanner mitigates this with directory splitting and leader balancing, but it cannot eliminate contention entirely.
Summary of Spanner
Spanner achieves what many thought impossible: global external consistency at scale. It does so by throwing hardware at the problem (atomic clocks, GPS receivers) and engineering a meticulous protocol that uses time as an ordering primitive. The cost is in infrastructure, latency (the TrueTime wait), and complexity.
Spanner is the gold standard for applications that cannot tolerate any inconsistency: financial systems, inventory management, global user account balances, and any system where you need to be right, even if that means being slightly slower.
Part 5: Head‑to‑Head Comparison
Now that we understand both systems in depth, let’s compare them across several dimensions.
Consistency Model
| Feature | Percolator | Spanner |
|---|---|---|
| Isolation Level | Snapshot Isolation | Serializable (external consistency) |
| Read Locking | Locks encountered during reads | Lock‑free snapshot reads |
| Write Locking | Optimistic? (Locks on rows) | Pessimistic (2PL) |
| Global ordering | Via Oracle timestamp | Via TrueTime + commit wait |
Percolator provides snapshot isolation, which is weaker than serializability. Conflicting writes can lead to a serialization anomaly called “write skew.” Spanner provides full serializability with external consistency, the strongest level.
Latency
| Operation | Percolator | Spanner |
|---|---|---|
| Single‑row read | ~1–2 ms (scan locks) | ~1 ms (replica read) |
| Two‑row transaction | 10–20 ms | 3–10 ms (single directory) |
| Multi‑directory TX | Not applicable (all rows in Bigtable key space) | 10–100 ms (depends on distance) |
Percolator’s latency is dominated by lock resolution and the two‑phase commit round‑trips. Spanner’s latency is dominated by Paxos quorum and the TrueTime wait. Spanner is generally faster for simple operations within a region, but can be slower for wide‑area distributed transactions.
Throughput and Contention
- Percolator: Very sensitive to lock contention. Hot rows become a bottleneck. Because locks are stored in the row, any concurrent contention forces a retry or wait. The system is designed for low contention batch workloads.
- Spanner: Uses fine‑grained locks and directory‑level partitioning. For a hot row, Spanner uses optimistic concurrency on reads and pessimistic for writes. Contention still exists, but Spanner can handle higher rates because locks are managed in‑memory by the Paxos leader and are not stored in the data (except for lock tables). However, a single directory cannot exceed the throughput of a single Paxos leader.
Availability and Fault Tolerance
- Percolator: The Oracle is a single point of failure if unreplicated, but it is replicated via Paxos in practice. The system can survive individual Bigtable server failures because Bigtable itself is fault‑tolerant. However, Percolator’s lazy lock resolution means that reads may hang while a lock is being resolved.
- Spanner: Every directory is replicated via Paxos, so it can survive minority failures. TrueTime requires at least one GPS receiver and functioning atomic clocks per datacenter; loss of GPS increases uncertainty but does not crash the system. The real risk is a widespread power outage that could disrupt TrueTime across multiple zones.
Operational Complexity
Percolator is simpler to implement if you already have Bigtable (or a similar key‑value store). It adds only the client‑side logic and a small oracle. Spanner requires an entirely new storage engine, replica management, clock infrastructure, and sophisticated transaction coordinator. It is a massive system to build and operate.
Cost
Percolator runs on commodity hardware with no special clock requirements. Spanner requires GPS receivers, atomic clocks, and the associated maintenance. Additionally, Spanner’s tightly synchronous replication across multiple datacenters consumes more network bandwidth and storage (multiple replicas). The total cost of ownership is much higher.
Part 6: When to Choose Which
Given the differences, the choice between Percolator and Spanner—or between systems inspired by them (CockroachDB, YugabyteDB, TiDB, FoundationDB, etc.)—depends on your workload’s requirements.
Choose Percolator‑Style when:
- Your workload is batch‑oriented or latency‑tolerant (hundreds of milliseconds or seconds is acceptable).
- You have low contention on rows (e.g., each transaction touches a small set of rows that are not frequently accessed by others).
- You are already using a key‑value store like Bigtable, HBase, or TiKV and want to add transactions without changing your storage layer.
- You need strong consistency but can tolerate snapshot isolation (i.e., you don’t need to prevent all serialization anomalies).
- You operate on a single datacenter or a single region (Percolator can be stretched across regions, but latency would be very high).
Choose Spanner‑Style when:
- You need external consistency (e.g., financial transactions, auction systems, reservation systems).
- Your data is globally distributed and you need reads and writes to be consistent across continents.
- You can tolerate slightly higher latency (tens to hundreds of milliseconds) for the sake of consistency.
- Your data has a natural partitioning (e.g., user accounts can be grouped into directories) that minimizes multi‑directory transactions.
- You have the resources to invest in infrastructure (either Google Cloud’s Spanner or a compatible implementation like CockroachDB).
The Middle Ground: CockroachDB and YugabyteDB
Many open‑source databases have attempted to replicate Spanner’s external consistency without the custom hardware. For example, CockroachDB uses a hybrid logical clock (HLC) that combines physical time with logical counters to bound uncertainty, avoiding the need for GPS receivers. This reduces TrueTime’s guarantee: CockroachDB provides “global snapshot isolation” with linearizability for single‑key operations, but not full external consistency for multi‑key transactions without additional blocking. The trade‑off is lower latency (no TrueTime wait) and lower cost.
YugabyteDB uses a similar approach with its own hybrid clock and supports both snapshot isolation and serializability. The open‑source world has largely embraced Spanner’s design principles without the hardware dependency, at the cost of slightly weaker consistency guarantees in practice (especially around clock skew).
If you are building a system today, you will likely not choose raw Percolator or raw Spanner. Instead, you will choose a database that implements similar ideas: TiDB (Percolator‑inspired), CockroachDB or YugabyteDB (Spanner‑inspired), or Google Cloud Spanner itself.
Part 7: Lessons for Distributed Systems Engineers
The journey from the lie of the monolithic database to the realities of Percolator and Spanner teaches several enduring lessons.
1. There is no magic bullet.
Distributed transactions always cost something. Percolator pays with latency and contention. Spanner pays with hardware and TrueTime wait. Every year, new papers claim to “solve” distributed transactions with clever protocols, but the fundamental constraints remain: you cannot achieve both high throughput, low latency, and strong consistency across a wide area without some form of synchronization that involves waiting.
2. Clock skew is the biggest hurdle.
Without a reliable clock, you cannot order events. Percolator sidesteps this by using locks to order transactions (the lock serves as a logical point of serialization). Spanner confronts it head‑on with TrueTime. Most systems in the wild use a hybrid approach, but the fact remains that clock skew is the primary reason why distributed transactions are slow.
3. Replication and transactions must be co‑designed.
Percolator’s transactions are independent of Bigtable’s replication (Bigtable uses a completely separate replication mechanism). This leads to a mismatch: a transaction might be committed on the primary but not yet replicated, and a subsequent read might see stale data from a replica. Percolator handles this by forcing all reads to go to a single region (or by using timestamp‑based reads that ignore replicas). Spanner integrates replication (Paxos) and transaction commit into the same protocol, ensuring that a transaction is not considered committed until it is durably replicated.
4. Locks are both a blessing and a curse.
Locks provide ordering but create contention and blocking. Percolator’s lock‑per‑row approach makes reads block on pending writes. Spanner’s lock table is managed in‑memory at the Paxos leader but still causes blocking for conflicting writes. The only way to avoid blocking entirely is to use optimistic concurrency control (OCC), but OCC requires retries and can degrade under high contention. The choice of locking vs. OCC is a design decision that affects performance dramatically.
5. Failure handling is the hardest part.
The papers describe the happy path. The reality is that production systems spend most of their code handling failures: network timeouts, crashed coordinators, split‑brain due to clock skew, deadlock detection, lock stealing, etc. Percolator’s lock resolution protocol is complex and error‑prone. Spanner’s use of Paxos means that leaders can fail and a new leader must recover the state without losing in‑flight transactions. Both systems have sophisticated recovery mechanisms that are rarely described in detail.
Part 8: The Future
What comes after Spanner? The search for a distributed transaction protocol that is both fast and strongly consistent continues. Some promising directions include:
- Deterministic databases (e.g., Calvin, FaunaDB): Pre‑plan the entire execution order of transactions before they start, eliminating the need for runtime locking. This works well for workloads where transaction patterns are known ahead of time.
- Calvin is a deterministic database that batches transactions into “epochs” and processes them in a fixed order. It achieves high throughput but requires all input to be known before execution begins.
- Gossip‑based protocols: Some research explores using epidemic broadcast to disseminate transaction commit information faster than Paxos, at the cost of weaker consistency guarantees.
- Hybrid logical clocks (HLCs): As used by CockroachDB, HLCs combine physical and logical time to bound uncertainty without hardware. They are not perfect but offer a pragmatic trade‑off for many applications.
Perhaps the ultimate lesson is that the lie of a single, consistent database is not something we can fully capture in a distributed world without paying a heavy price. As engineers, we must decide which parts of the lie we need to preserve and which we can afford to discard.
For most applications, eventual consistency is good enough. For a few critical ones, Percolator’s snapshot isolation or Spanner’s external consistency are the right tools. But for all of them, we must accept that the fortress has fallen, and we now live in the city.
Conclusion
Distributed transactions are not a solved problem. They are a continuous negotiation between our desire for simplicity and the messy realities of physics, network faults, and concurrent access. Percolator and Spanner represent two ends of a spectrum: one is a pragmatic hack that layers ACID on an eventually consistent store; the other is a ground‑up reinvention of the database that uses novel clock infrastructure to achieve the strongest possible consistency.
Neither is perfect. Percolator is slow under contention and requires careful monitoring of lock resolution. Spanner is expensive, operationally heavy, and introduces latency via TrueTime waiting. Yet both have been used in production at Google for years, processing billions of transactions and serving billions of users.
As you design your own distributed systems, remember the lessons from these two giants: embrace the lie gently, design for failure, measure clock skew, and always ask what you are willing to pay for consistency. The answer will guide you to the right tool—Percolator, Spanner, or something in between.
Now go forth and build systems that respect the reality of distribution. And when someone asks you for a “truly global ACID database,” smile knowingly, hand them this blog post, and say: “Yes, but here’s what it costs.”
This post was originally written as an extended exploration of distributed transaction realities. The code examples are simplified illustrations; production implementations are vastly more complex. For further reading, see the original Percolator and Spanner papers, as well as the open‑source databases inspired by them.