Write-Ahead Logging: The Unsung Hero of Database Durability

Dive deep into write-ahead logging (WAL), the technique that lets databases promise durability without sacrificing performance. Learn how WAL works, why it matters, and how modern systems push its limits.
Every database makes a promise: your committed data will survive crashes, power failures, and hardware hiccups. This promise—durability—seems simple until you consider the physics. Disks are slow. Memory is volatile. Writes can be reordered. Yet somehow, databases deliver both durability and performance. The secret weapon? Write-ahead logging.
1. The Durability Problem
Consider what happens when you execute UPDATE accounts SET balance = balance - 100 WHERE id="i87af13">. The database must:
- Find the relevant page in memory (or load it from disk)
- Modify the balance value
- Eventually write the page back to disk
The problem: step 3 is expensive. A single 4KB page write to an SSD takes 50-100 microseconds. To a spinning disk, it’s 5-10 milliseconds. If we waited for every modification to reach disk before acknowledging the transaction, throughput would collapse.
But what if the system crashes between step 2 and step 3? The modification exists only in volatile memory—it’s lost forever. Worse, what if the crash happens during step 3, leaving the page partially written and corrupted?
Databases need a way to guarantee durability without waiting for every page write. Write-ahead logging provides exactly that.
2. The Write-Ahead Logging Protocol
Write-ahead logging (WAL) rests on a deceptively simple principle: before modifying any data page, first write a log record describing the change to a sequential log file, and ensure that log record reaches stable storage.
The protocol has three rules:
Log before data: A log record describing a modification must be written to stable storage before the modified data page is written.
Commit record: When a transaction commits, a commit log record must be written to stable storage before the commit is acknowledged to the client.
Sequential writes: Log records are appended sequentially, never modified in place.
Why does this work? Sequential writes are fast—SSDs and HDDs both optimize for sequential access. A commit requires only one sequential write (the commit record), not random writes to potentially dozens of modified pages. The log captures enough information to reconstruct any changes, so even if data pages are lost, recovery can replay the log.
2.1 Anatomy of a Log Record
A typical log record contains:
+--------+--------+--------+--------+----------------+--------+
| LSN | TxID | Type | PageID | Payload | CRC |
+--------+--------+--------+--------+----------------+--------+
- LSN (Log Sequence Number): A monotonically increasing identifier for the record. Often the byte offset in the log file.
- TxID: The transaction that generated this record.
- Type: The operation type (INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT, etc.).
- PageID: The page affected by this operation.
- Payload: The actual change data. For an UPDATE, this might be the old and new values (or just the new value, depending on the logging strategy).
- CRC: Checksum for integrity verification.
2.2 Logical vs. Physical Logging
There are two approaches to what goes in the payload:
Physical logging records the exact bytes changed:
Page 42, offset 128: old=0x00000064 new=0x0000003C
This is simple and fast to apply during recovery, but generates large logs for operations that touch many bytes.
Logical logging records the operation itself:
UPDATE accounts SET balance = balance - 100 WHERE id="ibbb366"
This is compact but requires re-executing the operation during recovery, which may be complex and must be deterministic.
Most systems use physiological logging—a hybrid. Log records are physical within a page (describing byte-level changes) but logical across pages (not recording the effects of page splits or other structural changes at the byte level).
3. The Log Buffer and Flushing
Writing every log record immediately to disk would defeat the purpose—we’d trade random page writes for nearly as many sequential log writes. Instead, databases buffer log records in memory and flush them in batches.
3.1 Group Commit
When multiple transactions commit around the same time, their commit records can be flushed together in a single I/O operation. This is group commit:
- Transaction A commits, queues its commit record, waits
- Transaction B commits, queues its commit record, waits
- Transaction C commits, queues its commit record, triggers flush
- All three commit records write in one I/O
- All three transactions are acknowledged
Group commit amortizes the cost of fsync() across multiple transactions. Under high concurrency, this can improve throughput by 10-100×.
3.2 Flush Policies
When should the log buffer flush?
- On commit: Always flush at least up to the transaction’s commit record. This is required for durability.
- Buffer full: When the log buffer fills, flush to make room.
- Periodic: Flush every N milliseconds to bound how much work would be lost on crash.
- On checkpoint: Flush all dirty log records before writing a checkpoint.
The tension is between latency (flush often = lower per-transaction latency but more I/O operations) and throughput (batch flushes = better I/O efficiency but higher latency for early arrivals in the batch).
3.3 The fsync() Tax
Calling write() isn’t enough—data may linger in the OS page cache. To guarantee durability, databases must call fsync() (or fdatasync(), or use O_DIRECT with O_SYNC). This forces data to stable storage but is expensive:
- SSD: 50-200 microseconds
- HDD: 5-15 milliseconds (a full rotation)
- Battery-backed RAID controller: may acknowledge immediately if write-back cache is trusted
The fsync() cost dominates commit latency for durable transactions. Optimizations focus on reducing fsync() frequency (group commit) or hiding its latency (asynchronous commits with risk disclosure).
4. Checkpoints: Bounding Recovery Time
If the system crashes, recovery must replay the log from some starting point. But replaying the entire log since the database was created would take forever. Checkpoints bound recovery time by establishing points where all committed data is known to be on disk.
4.1 Checkpoint Types
Sharp checkpoint (quiesce):
- Stop accepting new transactions
- Wait for all active transactions to complete
- Flush all dirty pages to disk
- Write a checkpoint record to the log
- Resume normal operation
This is simple but causes a pause—unacceptable for production systems.
Fuzzy checkpoint:
- Record the current LSN as the checkpoint start
- Write out dirty pages gradually, without stopping transactions
- Track which pages were dirty at checkpoint start
- Write a checkpoint record when all those pages are flushed
Fuzzy checkpoints allow continuous operation. Recovery starts from the checkpoint LSN and replays any log records after it.
Incremental checkpoint:
Track dirty pages continuously. Periodically flush a subset of dirty pages (e.g., the oldest or coldest). The checkpoint “advances” as pages are flushed. This spreads I/O evenly over time.
4.2 Checkpoint Contents
A checkpoint record typically includes:
- The LSN at checkpoint start
- A list of active transactions (for undo during recovery)
- The dirty page table: which pages were modified and their first modifying LSN
- Optionally, the transaction table: each active transaction’s state
This metadata enables efficient recovery—the system knows exactly which log records might need to be replayed and which transactions were in progress.
5. Recovery: ARIES and Beyond
The canonical recovery algorithm is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in the early 1990s. ARIES handles all the corner cases: crashes during recovery, nested transactions, and fine-grained locking.
5.1 ARIES Recovery Phases
Recovery proceeds in three phases:
Phase 1: Analysis
Scan the log forward from the last checkpoint. Reconstruct the dirty page table and transaction table as they were at crash time. Identify:
- Which pages might need redo (were dirty and might not have been flushed)
- Which transactions were active (need undo)
Phase 2: Redo
Scan forward again, this time applying log records. For each record:
- If the page’s LSN is less than the record’s LSN, the change might be missing—reapply it
- If the page’s LSN is ≥ the record’s LSN, the change already reached disk—skip it
Redo is idempotent: applying a change multiple times has the same effect as applying it once. This is crucial—we don’t know exactly which writes made it to disk.
Phase 3: Undo
For each transaction that was active at crash time, roll back its changes. Process the log backward, undoing each operation for these transactions. Write Compensation Log Records (CLRs) for each undo operation—this ensures that if we crash during recovery, we don’t undo the same operation twice.
5.2 The LSN: Cornerstone of Recovery
The Log Sequence Number appears everywhere:
- Each log record has an LSN
- Each page stores the LSN of the last log record that modified it (the page LSN)
- The dirty page table maps pages to their recovery LSN (first dirty modification)
- CLRs reference the LSN of the record they’re compensating
During redo, comparing the page LSN to the log record’s LSN determines whether to reapply. This simple comparison makes recovery correct even when we don’t know exactly what reached disk.
5.3 Physiological Logging and Page Splits
Consider a B-tree insertion that causes a page split. Physical logging would need to record changes to:
- The original page (removed entries)
- The new page (added entries, initialized headers)
- The parent page (new pointer)
- Possibly more pages if the split propagates
This is complex and verbose. Physiological logging instead records:
- “Split page P1 creating P2 with key K as separator”
During redo, the system re-executes the split logic. The page LSN check ensures we don’t re-split an already-split page.
6. WAL in Practice: PostgreSQL
PostgreSQL’s WAL implementation illustrates real-world considerations.
6.1 WAL Segment Files
PostgreSQL organizes WAL into segment files, typically 16MB each. File names encode the timeline and segment number:
000000010000000000000001
000000010000000000000002
...
The timeline identifier supports point-in-time recovery (PITR) and replication branching.
6.2 WAL Buffers and Background Writer
The wal_buffers setting controls the shared memory WAL buffer size (default: ~16MB). Log records accumulate here until:
- A transaction commits (triggers flush up to its commit record)
- The buffer fills
- The background writer decides to flush
The walwriter background process periodically flushes WAL buffers to reduce commit latency jitter.
6.3 Synchronous vs. Asynchronous Commit
PostgreSQL’s synchronous_commit setting offers trade-offs:
- on (default): Commit waits for WAL flush to disk. Full durability.
- remote_write: In replication, commit waits for WAL to reach standby’s memory. Durable if both nodes don’t fail.
- local: Commit waits for local WAL flush only.
- off: Commit returns immediately. WAL flushes in background. Up to 3× wal_writer_delay of transactions could be lost on crash.
Setting synchronous_commit = off is not “turn off durability”—it’s “accept a small window of potential loss for better latency.”
6.4 Full Page Writes
After a checkpoint, the first modification to a page writes the entire page image to WAL, not just the change. This handles torn pages: if the OS writes 4KB atomically but the database uses 8KB pages, a crash mid-write could corrupt the page. The full page image in WAL allows recovery to restore a consistent page.
This increases WAL volume significantly (full_page_writes = on is the default). Some file systems and hardware with atomic writes > page size can disable this.
7. WAL in Practice: SQLite
SQLite, the embedded database, takes a different approach with its WAL mode.
7.1 Shadow Paging vs. WAL
Originally, SQLite used shadow paging (rollback journal):
- Before modifying a page, copy the original to a rollback journal
- Modify pages in place
- On commit, delete the rollback journal
- On crash, restore original pages from the journal
This is simple but has drawbacks: writes are random (to both database and journal), and readers block writers.
WAL mode inverts this:
- Writes go to a separate WAL file
- The main database file is not modified during transactions
- Readers access the main database file plus relevant WAL entries
- Periodically, WAL entries are “checkpointed” back to the main database
7.2 WAL Mode Benefits
- Readers don’t block writers: The main database file is stable; readers access it directly while writers append to WAL.
- Writers don’t block readers: Readers see a consistent snapshot from before the write started.
- Faster commits: Sequential WAL writes are faster than random database writes.
- Better concurrency: Multiple readers can proceed simultaneously with a writer.
7.3 WAL-Index (wal-index)
SQLite maintains a WAL-index in shared memory mapping page numbers to WAL frame numbers. Readers consult this index to find the latest version of a page—either in WAL or the main database file.
The WAL-index uses a hash table for fast lookups and is rebuilt from the WAL file if corrupted or missing.
8. WAL in Distributed Systems
Distributed databases face additional challenges: coordinating logs across nodes, ensuring consistency, and handling network partitions.
8.1 Replicated WAL
In primary-backup replication, the primary’s WAL is shipped to standbys:
- Primary writes log records locally
- Primary sends log records to standbys (synchronously or asynchronously)
- Standbys apply log records to their copies
- On primary failure, a standby with the most complete log takes over
PostgreSQL streaming replication, MySQL binlog replication, and many others use this pattern.
8.2 Consensus-Based WAL
Raft and Paxos turn the log into a replicated state machine:
- A leader proposes appending an entry to the log
- Followers accept and persist the entry
- Once a majority acknowledges, the entry is committed
- All nodes apply committed entries in order
The log is the source of truth; the database state is a derived projection of the log. This inverts the traditional model where the database is primary and the log is a recovery mechanism.
8.3 Log-Structured Merge-Trees and WAL
LSM-tree databases (LevelDB, RocksDB, Cassandra) blur the line between WAL and data storage:
- Writes go to an in-memory memtable and a WAL (for durability)
- When the memtable fills, it’s flushed to an SSTable (sorted string table) on disk
- The WAL can be discarded once its contents are in SSTables
The WAL here is purely for crash recovery of the memtable. The SSTables themselves form a log-structured store where data is written sequentially and compacted in the background.
9. Performance Optimization Techniques
9.1 Reducing fsync() Frequency
The most impactful optimization is calling fsync() less often:
- Group commit: Batch multiple commits into one fsync()
- Commit delay: Wait a few milliseconds before flushing to gather more commits (trade latency for throughput)
- Asynchronous commit: Return to client before fsync() completes (trade durability for latency)
9.2 Parallel Log I/O
Some systems maintain multiple log files and stripe writes across them:
- Increases aggregate bandwidth
- Requires careful coordination to maintain ordering guarantees
- More complex recovery (merge multiple streams)
9.3 Log Compression
Compress log records before writing:
- Reduces I/O volume (log writes are often I/O-bound)
- Adds CPU overhead (compression/decompression)
- Must be careful about compression boundaries—don’t want to decompress half a record
LZ4 and Snappy offer good speed/ratio trade-offs for real-time compression.
9.4 Non-Volatile Memory (NVM)
Intel Optane and similar persistent memory technologies change the game:
- Write latency: ~300 nanoseconds (vs. 50+ microseconds for SSD)
- No fsync() needed—writes are immediately durable
- Byte-addressable—no need for block-sized log records
With NVM, WAL can become a persistent in-memory buffer. Some systems eliminate WAL entirely, directly persisting data structures to NVM with careful ordering.
9.5 Direct I/O and io_uring
Traditional I/O through the page cache adds overhead:
- Data copies from user space to kernel buffers
- Page cache management
- Potentially unnecessary read-ahead
Direct I/O (O_DIRECT) bypasses the page cache, putting the database in full control. Combined with io_uring for asynchronous operations:
// Submit multiple log writes asynchronously
io_uring_prep_write(sqe, log_fd, buffer, size, offset);
io_uring_submit(ring);
// Later, reap completions
io_uring_wait_cqe(ring, &cqe);
This can significantly reduce latency and CPU overhead for log writes.
10. WAL Size Management
The WAL grows indefinitely without intervention. Managing its size is essential.
10.1 Log Truncation
Old log records are no longer needed after:
- All transactions that generated them have committed or aborted
- All dirty pages they modified have been checkpointed to disk
- Any replicas have received and applied them
The database tracks the “oldest needed LSN” and truncates the log up to that point.
10.2 Archive and Point-in-Time Recovery
For disaster recovery, WAL segments can be archived before truncation:
- Copy filled WAL segments to archive storage (S3, HDFS, tape)
- To recover: restore a base backup, then replay archived WAL segments
This enables point-in-time recovery (PITR): restore to any moment by replaying WAL up to that LSN.
PostgreSQL’s archive_command and restore_command configure this:
archive_command = 'cp %p /archive/%f'
restore_command = 'cp /archive/%f %p'
10.3 Log Recycling
Instead of creating and deleting segment files, recycle them:
- Fill segment 001, mark ready for archive
- Start writing to segment 002
- When segment 001 is archived and no longer needed, rename it to 003
- Reuse 003 after filling 002
This reduces file system overhead and metadata updates.
11. Correctness Considerations
11.1 Torn Writes and Checksums
A power failure mid-write can leave partial data on disk. Defenses:
- Checksums: Every log record includes a CRC. On recovery, validate checksums; discard corrupted records at the end.
- Double writes: Write to two locations; use the intact copy. (More common for data pages than log.)
- Atomic write units: Use hardware or file system features that guarantee atomic writes.
11.2 Write Ordering
Databases assume certain write ordering guarantees:
- Log writes before data writes
- Data writes before checkpoint records
- fsync() before returning to client
File systems and storage controllers may reorder writes. Use:
- fsync() / fdatasync() to force ordering
- O_SYNC or O_DSYNC for immediate durability
- Write barriers (deprecated on Linux, use explicit flushes)
Beware of “lying” controllers that acknowledge writes before they reach stable storage. Battery-backed write caches are acceptable; volatile caches are not.
11.3 Testing with Fault Injection
Databases should test crash recovery exhaustively:
- Kill -9 tests: Stop the process abruptly and verify recovery
- Power failure simulation: Use tools like dm-flakey or libeatmydata to simulate crashes
- File system corruption: Inject bad blocks or truncate files
- Slow I/O: Delay writes to expose timing-related bugs
Formal verification of WAL implementations (e.g., using TLA+) catches subtle bugs in the protocol logic.
12. WAL Alternatives and Complements
12.1 Shadow Paging
Instead of logging changes, maintain two copies of each page:
- Current: the live version
- Shadow: the previous committed version
On commit, atomically switch from shadow to current (update a single pointer). On abort or crash, the shadow version is intact.
Shadow paging has fallen out of favor:
- Fragmentation: pages scatter across disk
- Pointer updates cascade (updating a leaf requires updating parent, etc.)
- Concurrency is harder than with WAL
LMDB uses a variant with copy-on-write: modified pages are written to new locations, and a root pointer update commits the transaction.
12.2 Command Logging
Log the commands (SQL statements, operations) instead of their effects:
- Very compact logs
- Requires deterministic execution for replay
- Recovery replays commands, which may be slow
VoltDB uses command logging, relying on deterministic execution within a partition.
12.3 Write-Behind Logging (WBL)
With NVM’s fast writes, WAL’s overhead becomes significant. Write-behind logging inverts the order:
- Write dirty data directly to NVM (fast)
- Log which data was written (for recovery metadata)
- On crash, analyze what might be inconsistent and recover
This is speculative—WBL requires new data structures and recovery logic tailored to NVM’s characteristics.
13. Case Study: WAL in etcd
etcd, the distributed key-value store backing Kubernetes, uses WAL for durability within each Raft node.
13.1 WAL Structure
Each etcd node maintains a WAL directory:
wal/
0000000000000000-0000000000000000.wal
0000000000000000-0000000000001000.wal
...
Segment files contain:
- Raft log entries (proposals, configuration changes)
- Raft state (term, vote, commit index)
- CRC records for integrity
13.2 Snapshot and WAL Compaction
As the Raft log grows, etcd takes snapshots:
- Serialize the current key-value state to a snapshot file
- Record the snapshot’s index in the WAL
- Truncate WAL entries before the snapshot index
Recovery loads the latest snapshot, then replays WAL entries after it.
13.3 fsync() on Every Entry
etcd fsync()s after every Raft entry by default—correctness requires durability before acknowledging. This limits single-node throughput but guarantees linearizability.
The –wal-fsync-interval flag allows batching for higher throughput with slightly weaker guarantees (suitable for some workloads).
14. Case Study: WAL in Apache Kafka
Kafka is fundamentally a distributed commit log. Each partition is an append-only log stored as segment files.
14.1 Log Segments
Kafka partitions consist of segment files:
00000000000000000000.log
00000000000000000000.index
00000000000000012345.log
00000000000000012345.index
...
Each .log file contains messages; the .index file maps offsets to file positions for fast seeks.
14.2 Producer Acknowledgments
Kafka producers can choose durability level:
- acks=0: Fire and forget. No durability guarantee.
- acks=1: Wait for leader to write to its log. Durable if leader doesn’t fail.
- acks=all: Wait for all in-sync replicas to acknowledge. Full durability.
The log.flush.interval.messages and log.flush.interval.ms settings control when Kafka calls fsync(). By default, Kafka relies on replication for durability, not fsync(), accepting small data loss windows for throughput.
14.3 Log Compaction
Kafka supports two retention policies:
- Time/size-based: Delete segments older than N days or when log exceeds N bytes.
- Compaction: Retain only the latest value for each key. Useful for changelog topics.
Compaction is a form of garbage collection for the log, preserving state while reducing storage.
15. WAL and Storage Class Memory
Emerging storage class memory (SCM) like Intel Optane DC Persistent Memory blurs the line between memory and storage.
15.1 DAX and PMEM
Direct Access (DAX) mode allows applications to mmap() persistent memory and access it directly:
void* pmem = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_SYNC, fd, 0);
// Writes to pmem are directly persistent (after cache flush)
_mm_clflush(pmem + offset);
_mm_sfence();
No syscalls, no page cache—just load and store instructions.
15.2 Redesigning WAL for PMEM
With PMEM, traditional WAL has overhead:
- Copying data to log buffers (unnecessary with direct access)
- fsync() calls (replaced by cache flush instructions)
- Sequential-only access (PMEM supports random access efficiently)
New designs:
- Log directly to PMEM: Allocate log buffer in persistent memory. Writes are immediately durable.
- In-place updates with undo logging: Log the old value, update in place, flush. Simpler than redo logging.
- Hybrid: Keep hot path in PMEM, spill to SSD for capacity.
15.3 Challenges
PMEM introduces new challenges:
- Cache line granularity: CPU cache flushes are 64 bytes. Writes smaller than a cache line may require read-modify-write.
- Ordering: Memory fences and flush instructions have specific semantics. Getting ordering wrong corrupts data.
- Wear leveling: PMEM has limited write endurance. Avoid hot spots in the log.
Libraries like PMDK (Persistent Memory Development Kit) provide safe abstractions for PMEM programming.
16. Debugging and Observability
16.1 Log Analysis Tools
Most databases provide tools to inspect WAL contents:
- PostgreSQL: pg_waldump decodes WAL records
- MySQL: mysqlbinlog decodes the binary log
- SQLite: sqlite3 .dump with WAL mode shows effective state
These tools help diagnose:
- What operations are generating the most WAL
- Whether WAL growth is expected
- What happened before a crash
16.2 Metrics to Monitor
Key WAL metrics:
- WAL write rate: Bytes per second written to WAL. High rates may indicate heavy write workloads or inefficient operations.
- WAL fsync latency: Time per fsync() call. Spikes indicate storage issues.
- WAL buffer wait time: How long transactions wait for buffer space. High waits suggest increasing wal_buffers.
- Checkpoint frequency and duration: Frequent or long checkpoints indicate I/O or configuration issues.
- WAL size / segment count: Growth indicates archiving or replication lag.
16.3 Tracing
Tracing frameworks (DTrace, BPF, perf) can instrument WAL operations:
# Trace PostgreSQL WAL writes
bpftrace -e 'tracepoint:syscalls:sys_enter_write
/comm == "postgres" && args->fd == $WAL_FD/
{ @bytes = hist(args->count); }'
This reveals I/O patterns: are writes batched efficiently? Is there unexpected synchronization?
17. Common Pitfalls
17.1 Ignoring Disk Flush Semantics
Assuming write() implies durability is a classic mistake. Always:
- Call fsync() or fdatasync() for durability
- Test with actual power loss (not just kill -9)
- Verify storage controller write cache settings
17.2 Log Contention
A single log can become a bottleneck:
- All transactions serialize on log buffer allocation
- A single fsync() blocks all waiters
Mitigations:
- Group commit reduces fsync() frequency
- Multiple log files (if supported)
- Asynchronous commit for suitable workloads
17.3 Unbounded Log Growth
Forgetting to checkpoint or archive leads to disk exhaustion. Configure:
- Regular checkpoints (checkpoint_timeout, checkpoint_completion_target in PostgreSQL)
- WAL archiving or streaming replication
- Monitoring and alerts on WAL size
17.4 Recovery Testing Neglect
Many systems never test recovery until production disaster strikes. Regularly:
- Practice recovery procedures
- Verify backups are restorable
- Test point-in-time recovery to specific LSNs
18. Future Directions
18.1 Hardware Acceleration
Emerging hardware offers log acceleration:
- Computational storage: Push log compression/checksum to the drive
- SmartNICs: Offload log replication to network hardware
- Custom ASICs: Specialized log processing units
18.2 Tiered Storage
Logs don’t need to live on one tier:
- Hot (newest) log records on fast storage (NVM, SSD)
- Warm (recent) records on standard SSD
- Cold (archive) records on HDD or object storage
Automated tiering based on age or access patterns.
18.3 Log as the Database
The logical conclusion of log-centric design: the log is the database, and everything else is a materialized view. Kafka’s evolution toward this model, systems like Materialize, and event sourcing architectures embrace this philosophy.
Benefits:
- Simpler consistency model
- Natural audit trail
- Easy to derive multiple views
Challenges:
- Log growth management
- Query efficiency on append-only structures
- Integration with existing ecosystems
19. Implementation Checklist
When implementing or configuring WAL:
Understand durability requirements: What data loss is acceptable? Zero? A few seconds? This determines flush policy.
Size buffers appropriately: Too small = excessive flushing. Too large = wasted memory and longer recovery.
Configure checkpoints: Balance recovery time (frequent checkpoints = fast recovery) against I/O overhead.
Test crash recovery: Regularly crash the system and verify recovery correctness.
Monitor WAL metrics: Track write rates, fsync latency, and log size. Alert on anomalies.
Plan for growth: Archive or replicate WAL before truncation. Test restore procedures.
Understand your storage: Know your disk’s fsync behavior, write cache policy, and failure modes.
Consider replication: WAL shipping provides durability beyond a single node.
20. The Mathematics of WAL Performance
20.1 Throughput Modeling
Let’s model WAL throughput under group commit. Assume:
- t_write: Time to write a batch of log records
- t_fsync: Time to fsync (typically dominates)
- n: Average transactions per batch
Throughput = n / (t_write + t_fsync)
With t_fsync ≈ 100 microseconds (SSD) and efficient batching (n = 100): Throughput ≈ 100 / 0.0001 = 1,000,000 transactions per second
This theoretical limit explains why group commit is so effective—amortizing the fixed fsync cost over many transactions.
20.2 Recovery Time Estimation
Recovery time depends on:
- L: Log length since last checkpoint (in bytes)
- R: Redo processing rate (bytes per second)
- U: Number of uncommitted transactions (for undo)
- T_undo: Average undo time per transaction
Total recovery time ≈ L/R + U × T_undo
For a system with 1GB of log since checkpoint, 100MB/s redo rate, and 1000 uncommitted transactions at 1ms each: Recovery ≈ 10 seconds + 1 second = 11 seconds
This guides checkpoint frequency decisions: more frequent checkpoints mean smaller L and faster recovery.
20.3 WAL Space Amplification
WAL introduces space overhead:
- Each write appears twice: once in the log, once in the data file
- Full page writes (after checkpoints) add more overhead
- Log retention for replication or PITR extends this
Space amplification = (Log size + Data size) / Effective data size
With aggressive log retention, amplification can reach 3-5×. This trade-off between durability and space is fundamental.
21. Summary
Write-ahead logging is the foundation of database durability. By writing a sequential log before modifying data pages, databases achieve both crash safety and high performance. The key principles:
- Log before data: Never modify a page without first logging the change
- Sequential writes win: Logs are append-only, turning random writes into sequential ones
- Group commit amortizes fsync(): Batch commits together to reduce I/O overhead
- Checkpoints bound recovery: Periodic checkpoints limit how much log must be replayed
- ARIES provides a complete framework: Analysis, redo, undo phases handle all recovery scenarios
Modern systems push these foundations further with replicated logs, consensus protocols, and emerging persistent memory. But the core insight remains: by constraining how we write, we gain the freedom to recover from anything.
Whether you’re building a new database, configuring an existing one, or designing a distributed system, understanding WAL is essential. It’s the unsung hero working quietly behind every committed transaction, ensuring your data survives whatever the world throws at it.