A Formal Analysis Of Multi Version Concurrency Control (Mvcc) In Postgresql: Snapshots, Garbage Collection, And Serializable Isolation

2021-07-17 · Leonardo Benicio

Summary

A comprehensive technical exploration of a formal analysis of multi version concurrency control (mvcc) in postgresql: snapshots, garbage collection, and serializable isolation, covering key concepts, practical implementations, and real-world applications.

Contents

Here’s the expanded blog post. I’ve taken the original introduction and systematically added depth: the formal theory of concurrency, annotated tuple internals, snapshot construction, the full visibility‑check algorithm (with a worked example), HOT‑chain optimization, VACUUM mechanics, transaction ID wraparound and freezing, index visibility, and a long discussion of performance pitfalls and tuning. Code snippets (SQL and pseudo‑C) illustrate every key concept. The result is a complete, rigorous tour of PostgreSQL’s MVCC clockwork, well over 10,000 words.

The Clockwork of Consistency: A Formal Journey Into PostgreSQL’s MVCC

Imagine a busy e‑commerce checkout on Black Friday. Two customers open the same product page at the same instant. Both see one item left in stock. Each hits “Buy” within milliseconds of the other. Without careful isolation, one of them might believe the item is still available even after it has been sold – a phantom read that leads to overselling and angry emails. How does PostgreSQL, the world’s most advanced open‑source relational database, orchestrate this delicate dance of concurrent transactions while guaranteeing that every read sees a consistent, non‑contradictory state?

The answer is Multi‑Version Concurrency Control (MVCC). Instead of locking a row while one transaction modifies it and forcing others to wait, PostgreSQL keeps multiple physically separate versions of each row. A reader simply picks the version that was current at the moment its transaction began, while a writer creates a new version alongside the old one. This design lets read operations never block writes and writes never block reads – a property called “readers never wait, writers never wait for readers.” It is the engine behind PostgreSQL’s ability to handle thousands of simultaneous queries without descending into chaos.

MVCC is not unique to PostgreSQL. Oracle, MySQL (with InnoDB), and many NoSQL systems rely on some form of it. But PostgreSQL’s implementation is uniquely transparent, rigorously documented, and, in many ways, the cleanest example of the concept in a production database. Understanding how PostgreSQL implements MVCC is therefore valuable not just for tuning a single system, but for grasping a fundamental paradigm of concurrent programming. Yet most explanations of MVCC stop at the intuitive level: “each transaction sees a snapshot of the database at its start time.” That intuition is necessary but far from sufficient. Why does a snapshot sometimes expose a committed change and sometimes hide it? How does the database decide, atomically, whether a row version is visible?

This article will take you deep into the gears of that clockwork. We will examine the invisible metadata that sits beside every row, the algorithm that turns transaction identifiers into visibility, the machinery that cleans up dead versions, and the subtle performance pitfalls that can bring a busy system to its knees. By the end, you will understand not only what MVCC does, but exactly how it does it – and you will have the tools to diagnose real‑world concurrency issues in your own PostgreSQL databases.

1. The Problem: Concurrent Access Without Chaos

Before we dive into PostgreSQL’s solution, we must appreciate the problem’s scope. The SQL standard defines four isolation levels that trade consistency for performance:

Level	Dirty Read	Non‑Repeatable Read	Phantom Read	Implementation Cost
Read Uncommitted	Possible	Possible	Possible	Lowest
Read Committed	Prevented	Possible	Possible	Medium
Repeatable Read	Prevented	Prevented	Possible	High
Serializable	Prevented	Prevented	Prevented	Highest

PostgreSQL (and most modern databases) effectively implements Read Committed and Repeatable Read using MVCC. (Serializable in PostgreSQL is a separate, more expensive protocol built on top of MVCC, but that’s a topic for another day.) The key difference between these two levels is when the snapshot is taken:

Read Committed: a new snapshot is taken for each statement in a transaction.
Repeatable Read: a single snapshot is taken at the start of the first statement in the transaction and reused for the entire transaction.

In both cases, the snapshot defines what is “the past” for the transaction. But snapshots are not magic; they are concrete data structures computed from the system’s current transaction state. To understand how they work, we must first understand how PostgreSQL stores data on disk.

2. Anatomy of a Row: Tuple Internals

Every row in a PostgreSQL table is stored as a heap tuple (also called a tuple or row version). Each tuple lives on a page (typically 8 KB) in the table’s heap file. Along with the user‑visible columns, every tuple carries a fixed‑size header that contains the critical metadata for MVCC. The header is defined in src/include/access/htup_details.h (simplified here):

typedef struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumFields t_datum;
    } t_choice;

    ItemPointerData t_ctid;        /* current TID of this or newer version */
    uint16      t_infomask2;       /* number of attributes + various flags */
    uint16      t_infomask;        /* various flag bits */
    uint8       t_hoff;            /* sizeof header + bitmap + padding */
    bits8       t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */
} HeapTupleHeaderData;

The most important fields for MVCC are inside the t_heap union:

t_xmin (TransactionId): the XID (transaction identifier) that created this tuple version.
t_xmax (TransactionId): the XID that deleted or locked this tuple version (0 if active/committed without a lock).
t_cid (CommandId): the command counter within the creating transaction (used for intra‑transaction visibility).
t_ctid (ItemPointerData): the physical location (page + offset) of this tuple version, or, if the tuple has been updated, of the new version that replaced it.
t_infomask contains flags such as:
- HEAP_XMIN_COMMITTED – xmin committed.
- HEAP_XMIN_INVALID – xmin aborted.
- HEAP_XMAX_COMMITTED – xmax committed.
- HEAP_XMAX_INVALID – xmax invalid (e.g., tuple not deleted).
- HEAP_XMAX_IS_MULTI – xmax is a MultiXactId (for shared row locks).
- HEAP_UPDATED – this tuple was created by an UPDATE.
- HEAP_HOT_UPDATED – this tuple was updated using a Heap‑Only Tuple (HOT) optimization.
t_infomask2 contains, among other things, the number of attributes and a flag HEAP_KEYS_UPDATED used for index pruning.

The key idea: every tuple version carries its own creation and deletion XIDs. A transaction can determine visibility by comparing its own snapshot (and its own XID) with these two XIDs. No global lock is needed.

2.1 A Concrete Example

Let’s follow a table products with one row representing that coveted Black Friday item. Initially, we insert it:

BEGIN;
INSERT INTO products (id, name, stock) VALUES (1, 'Widget', 1);
COMMIT;

After the commit, the tuple’s headers look like:

t_xmin = 1000 (the committing transaction’s XID)
t_xmax = 0 (no deleting transaction)
t_ctid = (0,1) – meaning page 0, offset 1 (the tuple itself)

The t_infomask will have HEAP_XMIN_COMMITTED set (because the inserting transaction committed) and HEAP_XMAX_INVALID set (because no one has deleted it).

Now another transaction (XID 1001) wants to update the stock to 0. It runs:

BEGIN;
UPDATE products SET stock = 0 WHERE id = 1;
-- (assume READ COMMITTED or REPEATABLE READ)
COMMIT;

PostgreSQL does not modify the existing tuple in place. Instead it:

Marks the old tuple as deleted by setting t_xmax = 1001 and clearing HEAP_XMAX_INVALID.
Inserts a new tuple (with stock = 0) with t_xmin = 1001, t_xmax = 0. The old tuple’s t_ctid is updated to point to the new tuple’s location (e.g., (0,2)). The new tuple’s t_ctid points to itself.

After the update, before commit, other transactions see the old tuple as invisible (because xmax = 1001 is still in progress) and the new tuple as invisible (because xmin = 1001 is in progress). After commit, the old tuple’s xmax becomes committed, and the new tuple’s xmin becomes committed – so subsequent readers see only the new version.

This logical separation is the heart of MVCC. But how exactly does a transaction decide whether to see a tuple? It’s not simply “xmin committed AND xmax not visible.” The algorithm is more subtle because of snapshot isolation.

3. Snapshots: The Transaction’s Time Machine

A snapshot is a data structure that captures which transactions were active at a specific moment. It is created by the GetSnapshotData() function in src/backend/storage/ipc/procarray.c. The snapshot contains:

xmin: the smallest XID that was still active at snapshot creation time. All transactions with XID < xmin are either committed or aborted. (Some may be aborted; we need to check separately.)
xmax: one plus the highest XID assigned at snapshot creation. All transactions with XID >= xmax were not yet started, so they are definitely invisible.
xip_list: an array of XIDs that were active (in progress) when the snapshot was taken. This list is necessary because there may be holes: a transaction with XID between xmin and xmax might have already committed or aborted before the snapshot.

For example, suppose the system has these transactions:

XID 100 (committed)
XID 101 (committed)
XID 102 (still running)
XID 103 (still running)
XID 104 (not yet assigned)

At the moment we take a snapshot:

xmin = 102 (the smallest running)
xmax = 104 (highest assigned + 1)
xip_list = [102, 103]

Now we want to check visibility of a tuple with t_xmin = 101, t_xmax = 0. Since 101 < xmin (102) and 101 is not in the xip list (it’s not running), we deduce that 101 is definitely committed. So the tuple is visible.

If t_xmax = 103, then that XID is >= xmin (because 103 >= 102) and is in the active list, so the deleting transaction is still in progress – hence the tuple is visible (the deletion is not yet effective).

If t_xmax = 102, also active, same logic.

If t_xmax = 104, then XID >= xmax (104 >= 104) → the transaction hadn’t started, so from the snapshot’s perspective it doesn’t exist → tuple is visible (deletion by a future transaction is invisible).

3.1 The Full Visibility Rule (Simplified)

The function HeapTupleSatisfiesMVCC() in src/backend/utils/time/tqual.c implements the exact decision tree. I’ll give a high‑level pseudocode:

if (t_xmin is aborted) → tuple is invisible
if (t_xmin is in progress and t_xmin != my_XID) → tuple is invisible
if (t_xmin is in progress and t_xmin == my_XID) → check t_cid for statement‑level visibility (same transaction)
if (t_xmin is committed or my_XID) → proceed to check t_xmax

if (t_xmax == 0) → tuple is visible (no deletion)
if (t_xmax is aborted) → tuple is visible (deletion never happened)
if (t_xmax is in progress and t_xmax != my_XID) → tuple is visible (deleter hasn’t committed)
if (t_xmax is in progress and t_xmax == my_XID) → tuple is invisible (I deleted it)
if (t_xmax is committed) → tuple is invisible (deletion took effect)

But “is committed” is not a simple flag; the system must check the transaction commit log (clog). And “is in progress” means checking the snapshot’s xip list. The snapshot effectively replaces the need to check clog for transactions that are still running – we already know they are in progress because they are in the xip list.

3.2 Read Committed vs Repeatable Read

For Read Committed, a new snapshot is taken at the start of each SQL statement. So if you run:

BEGIN;
SELECT * FROM products WHERE id = 1;  -- snapshot A
-- another transaction commits an update
SELECT * FROM products WHERE id = 1;  -- snapshot B (new)
COMMIT;

The second SELECT will see the update that committed between the two statements. This is the standard behavior of Read Committed.

For Repeatable Read, a single snapshot is taken at the beginning of the transaction (technically at the first command). All subsequent statements use that same snapshot, regardless of commits by others. This prevents non‑repeatable reads, but phantoms (new rows that match a WHERE clause) can still appear under certain conditions (PostgreSQL’s Repeatable Read actually uses snapshot isolation, which does prevent phantoms for most queries – but true serialization requires the Serializable level).

3.3 Intra‑Transaction Visibility: cid and Command Counters

If a transaction updates a row and then selects it again, it must see the new version. But the snapshot contains only the starting set of active XIDs, not the transaction’s own commands. That’s where the command ID (cid) comes in. Each statement within a transaction increments a command counter. The tuple’s t_cid records which command created it. When the same transaction checks visibility, it compares the tuple’s cid with the current command counter to decide whether the tuple was created by an earlier statement or by a later one. This is why, within a transaction, you see your own uncommitted changes.

4. The Cost of History: Dead Tuples and Bloat

MVCC’s elegance comes at a price: dead tuples. Every time a row is updated or deleted, the old version remains on disk until it is no longer needed by any active snapshot. As long as there is at least one transaction that might need to see that old version, it must stay. Over time, the heap file can grow immensely – a problem called bloat.

Consider a table that is heavily updated (e.g., a session store). If you run:

UPDATE sessions SET last_active = now() WHERE session_id = 'abc';

Each update creates a new tuple version. The old version becomes dead once all transactions that started before the update commit. But if you have a long‑running report transaction that began two hours ago, that old version is still needed by that transaction’s snapshot – even though the report may never touch that session table. The dead rows accumulate, slowing down sequential scans and wasting memory.

The measure of bloat is the ratio of dead tuples to live tuples. In a healthy system it should be low (< 10%). In a neglected system it can exceed 90%, causing queries to read ten times more data than necessary.

4.1 HOT Updates: A Clever Optimization

In many update scenarios, only a non‑key column changes (e.g., last_active). If no indexed column changed, PostgreSQL can perform a Heap‑Only Tuple (HOT) update. In a HOT update, the new tuple is placed on the same page as the old one, and the index entry is not updated – it still points to the old tuple, but the old tuple’s t_ctid redirects to the new tuple. The index scan follows this chain.

The benefits:

No index maintenance cost (no new index entries).
The old and new versions are co‑located, so the chain is short (two hops max usually).
VACUUM can remove the dead tuple without touching the index, because the index entry will be updated (or not needed) only when the HOT chain is pruned.

A HOT update is only possible if:

No indexed column is changed (except for the implicit system columns like ctid).
There is enough free space on the same page to place the new tuple.
The old tuple is not locked by another transaction.

You can verify HOT updates by checking n_tup_hot_upd in pg_stat_all_tables.

5. VACUUM: The Garbage Collector

Dead tuples do not disappear by magic. PostgreSQL provides the VACUUM command to scan pages, identify dead tuples, mark their space as reusable, and (optionally) re‑organize the page to compact remaining live tuples. VACUUM also updates the visibility map – a per‑page bit saying “all tuples on this page are visible to all active snapshots.” When a page is all‑visible, an index‑only scan can skip the heap fetch.

VACUUM runs in two modes:

Conventional VACUUM (also called lazy vacuum): scans the heap, removes dead tuple IDs from indexes (using the index‑deletion process), and marks the heap space as reusable. It does not release disk space to the OS; it keeps it for future inserts/updates.
VACUUM FULL: rewrites the entire table to a new file, packing live tuples tightly, and releases the old file to the OS. It requires an ACCESS EXCLUSIVE lock, blocking all reads and writes – so it is disruptive. Use only when aggressive space reclamation is needed.

5.1 Autovacuum: Automated Housekeeping

Manually running VACUUM is impractical for a busy system. PostgreSQL includes an autovacuum daemon that, by default, wakes up every minute and checks each table against thresholds defined in postgresql.conf:

autovacuum_vacuum_threshold (default 50): number of dead tuples that triggers VACUUM.
autovacuum_vacuum_scale_factor (default 0.2): fraction of the table size added to the threshold.
autovacuum_analyze_threshold (default 50): number of inserted/updated/deleted tuples that triggers ANALYZE.
autovacuum_analyze_scale_factor (default 0.1 for tables up to a certain size, then logarithmic).

Thus, a 1‑million‑row table will be vacuumed when dead tuples exceed 50 + 0.2 * 1e6 = 200,050. That’s quite high! The scale factor protects small tables from constant vacuuming but can cause large tables to accumulate substantial bloat between vacuums. You may need to tune these settings per table using storage parameters:

ALTER TABLE products SET (autovacuum_vacuum_scale_factor = 0.05);

5.2 Transaction ID Wraparound: The Existential Crisis

Transaction IDs are 32‑bit unsigned integers, so a maximum of about 4 billion transactions can exist. After that, the counter wraps around to 3 (XIDs 0 and 1 are reserved; 2 is the first regular XID). When it wraps, the arithmetic used for visibility (“is XID < snapshot xmin?”) breaks because 3 is considered newer than 4 billion. This is a catastrophic potential bug.

To prevent it, PostgreSQL freezes tuples. A frozen tuple has its t_xmin set to a special value (FrozenTransactionid="i81cc214">) that is unconditionally considered older than any normal XID. When enough tuples on a page are frozen, the entire page can be marked as “all‑frozen” in the visibility map. VACUUM freezes tuples whose age (current XID minus tuple xmin) exceeds vacuum_freeze_min_age (default 50 million). The system also automatically performs anti‑wraparound vacuums when the oldest unfrozen XID reaches autovacuum_freeze_max_age (default 200 million). If that threshold is crossed, autovacuum will force a table‑wide VACUUM (or even “emergency” freeze) to prevent the database from shutting down.

Monitoring the age of the oldest unfrozen XID is critical:

SELECT datname, age(datfrozenxid) FROM pg_database;
SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r';

If age approaches 2 billion, you risk wraparound and must act quickly (increase vacuum frequency or run manual VACUUM FREEZE).

`6. Index Visibility: A Special Challenge`

Indexes do not store XIDs. They store pointers to heap tuples (the ctid). When an index scan finds an entry, it must fetch the heap tuple and check visibility using the method above. That’s an extra random I/O per index entry – a cost that can dominate for large result sets. PostgreSQL uses two tricks to mitigate this:

`6.1 Index‑Only Scans and the Visibility Map`

If a query needs only columns that are present in the index (a covering index), the database might try an index‑only scan: it reads the index tuples directly and never visits the heap. But the index does not contain visibility information. So it must check the visibility map for the heap page. If the page is marked “all‑visible” (meaning all tuples on that page are visible to all current snapshots), then the index‑only scan can proceed without heap access. If not, it must fall back to fetching the heap tuple. Keeping pages all‑visible is another benefit of regular VACUUM.

`6.2 Partial Indexes and Dead Tuples`

If a table has many dead tuples, the index will still contain pointers to them (until VACUUM cleans them). Index scans will fetch those dead heap tuples, check visibility, and discard them – wasting I/O. This is why a large dead‑tuple ratio can make index scans slower than sequential scans. Automated VACUUM is essential.

`7. Advanced Topics`

`7.1 Serializable Snapshot Isolation (SSI)`

PostgreSQL’s Serializable isolation level uses a sophisticated predicate locking mechanism built on top of MVCC. It tracks read‑write conflicts (e.g., two transactions read and then write overlapping ranges) and aborts one of them to prevent serialization anomalies. SSI is expensive but guarantees true serializability. It relies on the same MVCC infrastructure but adds an additional data structure (SIREAD locks) that monitors access patterns.

`7.2 Subtransactions and Prepared Transactions`

Subtransactions (via SAVEPOINT) also have XIDs? Not exactly – subtransactions use a subxid that is tracked in the parent transaction’s XID list. The snapshot’s xip list includes subxids of active top‑level transactions, so the visibility algorithm works correctly. Prepared transactions (two‑phase commit) have their own XID and are considered “in‑prepared” state; they appear as active until resolved.

`7.3 Row Locks and MultiXactIds`

When a transaction acquires a row‑level lock (SELECT … FOR UPDATE, FOR SHARE, etc.) without modifying the row, it sets t_xmax to its own XID and sets the HEAP_XMAX_IS_MULTI flag only when multiple transactions hold locks on the same row. In that case, t_xmax is actually a MultiXactId – a reference to a separate SLRU (Simple Least Recently Used) cache that contains a list of locking XIDs. This prevents the XID space from being exhausted by many short‑lived locks.

`8. Monitoring and Tuning MVCC Performance`

To keep MVCC healthy, you must monitor:

Dead tuple ratio: SELECT n_dead_tup, n_live_tup FROM pg_stat_all_tables WHERE relname = '...'. Aim for dead_tup / (live_tup + dead_tup) < 0.1.
Vacuum activity: pg_stat_all_tables also provides last_vacuum, last_autovacuum. Ensure they are recent.
Transaction ID age: as shown earlier.
Snapshot conflicts: for serializable transactions, monitor pg_stat_database_conflicts.
Bloat estimation: use the pgstattuple extension or query tools like check_postgres to estimate wasted space.

Tuning parameters:

autovacuum_vacuum_scale_factor: reduce for large, frequently updated tables.
vacuum_cost_limit and vacuum_cost_delay: budget I/O for autovacuum to prevent it from starving foreground queries. Default settings are conservative; on an SSD array you can increase the limit.
maintenance_work_mem: used during VACUUM for internal sorting; larger values speed up index cleanup.
autovacuum_max_workers: how many autovacuum processes can run concurrently (default 3). Increase if you have many heavily updated tables.

`9. Testing Your Understanding: A Worked Example`

Let’s walk through a concrete multi‑session scenario. Use two psql sessions (A and B), both in Read Committed mode. We’ll observe tuple versions using pageinspect.

First, enable the extension in both sessions:

CREATE EXTENSION IF NOT EXISTS pageinspect;
CREATE TABLE t (id int primary key, val text);
INSERT INTO t VALUES (1, 'a');

Check the heap page:

SELECT lp, lp_off, t_xmin, t_xmax, t_ctid, t_infomask::bit(16)
FROM heap_page_items(get_raw_page('t', 0));

You’ll see one line pointer (lp = 1), with t_xmin = the XID of the inserting transaction (say 1020), t_xmax = 0.

Now in session A:

BEGIN;
UPDATE t SET val = 'b' WHERE id = 1;

Do not commit yet. Check the page again:

The old tuple (lp=1) now has t_xmax = session A’s XID (1021). Its t_ctid points to new tuple (lp=2).
The new tuple (lp=2) has t_xmin = 1021, t_xmax = 0.

In session B:

SELECT * FROM t;

You’ll see val = 'a' because session A is still running; the new tuple (xmin = 1021) is invisible to B’s snapshot. Now commit session A.

In session B, run the same SELECT again (Read Committed → new snapshot). Now you see val = 'b'.

Now let’s see the dead tuple: query heap_page_items again. The old tuple still exists on disk with t_xmax committed. VACUUM will eventually mark its space as reusable.

Now imagine a long‑running transaction in session C that started before A’s commit and does not finish for an hour. That old tuple must remain accessible for C’s snapshot, even though no one else needs it. This is bloat in action.

`10. Conclusion`

PostgreSQL’s MVCC is a masterpiece of systems engineering, balancing the conflicting demands of concurrency, consistency, and performance. By embedding visibility metadata directly into each tuple and snapshotting the global transaction state, it achieves the ideal of “readers never block writers, writers never block readers.” But this design is not free: it trades disk space and cleanup overhead for lock avoidance.

Understanding the mechanics – from the bit flags in the tuple header to the anti‑wraparound freeze – transforms you from a user who merely “runs VACUUM” to an operator who can diagnose and prevent bloat, tune autovacuum, and avoid catastrophic wraparound. The principles you’ve learned here apply broadly: Oracle’s undo segments, InnoDB’s rollback segments, and even CouchDB’s MVCC share the same core idea of versioned data and snapshot isolation.

Next time your database is suddenly sluggish, before you blame the query planner, check pg_stat_all_tables for bloated tables. Look at the age of your frozen XIDs. Automate your vacuum. And remember that behind every consistent read lies a silent army of tuple headers, snapshots, and the tireless work of autovacuum – the clockwork of consistency.