Designing A Scalable Key Value Store With Lsm Trees And Write Amplification Analysis

2019-08-02 · Leonardo Benicio
Summary

A comprehensive technical exploration of designing a scalable key value store with lsm trees and write amplification analysis, covering key concepts, practical implementations, and real-world applications.

Contents

The Hidden War Inside Your Database: Designing a Scalable Key-Value Store with LSM Trees and the Brutal Economics of Write Amplification

Introduction

Imagine for a moment that you are a librarian. But not just any librarian—you are the sole librarian for a city of ten million people, each of whom demands instantaneous access to books that are constantly being updated, deleted, and rewritten. Every second, thousands of patrons storm your doors, each clutching a note that says either “change page 47 of War and Peace to say ‘potato’” or “destroy every copy of Moby Dick and replace it with a version where the whale wins.”

This is the reality of modern distributed storage systems. Every time you post a status update, stream a video, or execute a financial trade, somewhere in a data center, a database is performing a write operation that must be persisted reliably, retrieved quickly, and maintained consistently—all while handling millions of concurrent requests. And the architecture that makes this possible? It is far stranger, more counterintuitive, and more fascinating than most engineers realize.

We are living through an unprecedented explosion of write-heavy workloads. Social media feeds, IoT sensor streams, financial transaction logs, and real-time analytics pipelines are all generating data at rates that would have been unthinkable a decade ago. The humble key-value store—the simplest possible data structure, just a map from keys to values—has become the workhorse of modern infrastructure. Systems like RocksDB, LevelDB, Apache Cassandra, and HBase collectively handle exabytes of data, powering everything from Facebook’s messaging infrastructure to Netflix’s recommendation engines.

But here’s the dirty secret: building a key-value store that can handle petabyte-scale datasets with consistent write performance is hard. Harder than most engineers realize. The naive approach—just use a hashtable on disk—fails catastrophically. Hashtables suffer from random I/O, poor locality, and near-impossible range queries. B-trees, the darlings of relational databases, perform admirably for reads but become nightmare factories of write amplification when faced with random insertions and updates.

The solution that emerged from the academic and engineering trenches is the Log-Structured Merge-Tree (LSM tree)—a deceptively simple idea that has become the backbone of modern storage infrastructure. But LSM trees come with their own hidden war: the brutal economics of write amplification. Every byte you write to your database may cause 10, 20, or even 50 bytes of internal disk I/O, wearing out your SSDs, consuming network bandwidth, and burning through your cloud budget.

In this deep dive, you will learn what LSM trees are, why they dominate the big data landscape, and how to tame the beast of write amplification. We will walk through real-world implementations like RocksDB and Cassandra, examine the trade-offs between different compaction strategies, and explore advanced tuning techniques that can double or triple the throughput of your storage system. Along the way, I will arm you with concrete numbers, code snippets, and mental models that will transform how you think about the hidden war inside your database.


The Problem with Traditional Indexes

Before we understand why LSM trees exist, we must first understand why the obvious solutions fail so spectacularly under write-heavy workloads.

A Quick Refresher on Disk Access

To appreciate the challenges, we need to internalize one critical fact: disk I/O is measured in milliseconds for random access, while sequential access can be an order of magnitude faster. A modern NVMe SSD can handle around 1,000,000 random reads per second (IOPS) but can deliver 3 GB/s of sequential bandwidth—that’s roughly 300 times more data per second when accessed sequentially. The difference is even more dramatic for traditional spinning hard drives (HDDs), where random access takes ~10ms while sequential bandwidth is ~200 MB/s—a difference of 2,000x.

In other words, sequential is king. Any storage engine that requires random writes to disk is doomed to poor performance at scale. Yet that is exactly what B-trees and hash tables demand.

B-trees: The Relational Database Champion

B-trees (and their close cousins B+ trees) are the most common indexing structure in relational databases like MySQL, PostgreSQL, and Oracle. They look like a balanced tree where each node holds a range of keys and pointers to child nodes. Nodes are stored in fixed-size pages (typically 4 KB or 8 KB). To insert a key, the database traverses the tree to the appropriate leaf page, finds the correct position, and slots the key in. If the page is full, it splits into two pages, updating parent nodes.

This sounds fine—but the devil is in the write pattern. Writing a key requires a random read (to find the leaf page) followed by a random write (to update that page). Worse, page splits cause cascade updates that touch multiple pages. Consider the following:

  • Random inserts (e.g., writing timestamped data that hashes to arbitrary buckets) will hit random leaf pages. Every insert becomes a random I/O operation.
  • Write amplification in a B-tree: Each logical write of one key results in ~4-6 physical writes (reading the page, modifying it, writing it back, plus any split overhead). On a mechanical HDD, that means 10-20 ms per write—capping your insert rate at 50-100 writes per second.
  • Space amplification: B-trees maintain at least 50% occupancy per page on average. For large datasets, this means a lot of unused space.

B-trees excel at point reads (they require only O(log N) random I/Os) and range queries (since leaf pages are often linked). But for write-heavy workloads, they are a disaster. The worst-case scenario is when keys are inserted in random order—exactly what happens in many modern applications (e.g., user IDs, session tokens, UUIDs).

Hash Tables: Fast Point Reads, Awful Everything Else

Hash tables on disk (extendible hashing, linear hashing) solve the point read problem by using a hash function to locate data with O(1) I/O on average. However, they have three fatal flaws:

  1. No range queries: You cannot iterate over keys in sorted order without scanning the entire hash table. This makes range scans, prefix scans, and ordered iteration impossible.
  2. Random writes: Inserting a new key typically requires computing the bucket and writing to a random location. Same random I/O problem as B-trees.
  3. Resizing and rehashing: When the hash table grows, all data must be rehashed, causing a massive burst of random writes.

Hash tables are used in some specialized key-value stores (Redis is an in-memory hash table, but it persists via snapshotting/AOF, which is sequential), but for on-disk storage they are rarely the primary structure.

The Fundamental Conflict

The core tension is between sorted order (which enables efficient range queries and merges) and write-friendly access patterns (which require sequential I/O). B‑trees maintain sorted order on disk but pay the price of random writes. Hash tables give fast point reads but sacrifice ordering and cause random writes anyway.

What we really need is a data structure that:

  • Keeps data sorted for efficient range scans and merges.
  • Writes data sequentially to maximize disk bandwidth.
  • Allows efficient point reads despite the lack of in-place updates.

Enter the LSM tree.


Enter the LSM Tree: A Revolutionary Data Structure

The Log-Structured Merge-Tree was first described by O’Neil, Cheng, Gawlick, and O’Neil in a 1996 paper titled The Log-Structured Merge-Tree (LSM-tree). Their insight was elegant: instead of trying to update data in place, append everything to a sequential log, and then asynchronously merge older data into sorted structures.

Core Idea: Buffered, Append-Only, Merged

The LSM tree turns the problem upside down. Here’s the high-level architecture:

  • Writes are batched in memory in a sorted data structure called the memtable.
  • When the memtable reaches a certain size, it is flushed to disk as an immutable SSTable (Sorted String Table) – a sequential file containing sorted key-value pairs.
  • Over time, many SSTables accumulate at different levels. To maintain read performance and reclaim space, a background process called compaction merges SSTables together, discarding overwritten and deleted keys.
  • Reads must search through the memtable and then through the SSTables, from newest to oldest, until they find the desired key.

This design ensures that all writes to disk are sequential—both the initial flush and the later compaction merges. Random writes become a myth.

The Components in Detail

Let’s unpack each component and how they work together.

Memtable

The memtable is an in-memory sorted data structure, typically a skiplist or a balanced binary tree. Every write (put, delete, merge) goes first into the memtable. The memtable acts as a write buffer. When it reaches a configurable threshold (e.g., 64 MB), it becomes immutable and a new memtable is created to accept future writes. The old memtable is then flushed to disk.

Key properties:

  • Very fast writes: Memory writes are sub‑microsecond.
  • Sorted order: Keep keys in order for efficient flushes and future merges.
  • Concurrent access: RocksDB uses a single writer thread and lock‑free reads (with a spinlock for the active memtable).
  • Write Ahead Log (WAL): To ensure durability, each write is first appended to a sequential WAL on disk before being added to the memtable. That way, if the process crashes, the memtable can be reconstructed from the WAL on restart.

SSTable (Sorted String Table)

An SSTable is an immutable, sorted, on-disk file containing key-value pairs. The format is typically:

  • A data block containing contiguous key-value pairs.
  • An index block mapping the first key of each data block to its offset.
  • A footer with metadata.

Because SSTables are immutable, they can be read without locking. Multiple SSTables may contain the same key—the most recent version wins.

Levels (L0, L1, L2, …)

LSM trees organize SSTables into a hierarchy of levels. Level 0 (L0) is special: it holds SSTables flushed directly from the memtable. SSTables in L0 may have overlapping key ranges (because they are flushed in time order). Levels 1 and below maintain the invariant that SSTables within a level have non-overlapping key ranges and are sorted by key. Each level is typically 10× larger than the previous one (a common multiplier called the level size ratio).

For example, in LevelDB’s configuration:

  • L0: 4 files (each ~2 MB default), overlapping allowed.
  • L1: total size is 10 MB (4 files * 2 MB ~ 10 MB), non-overlapping.
  • L2: total size is 100 MB (10 × L1).
  • L3: 1 GB, etc.

This exponential growth ensures that the total data volume is distributed across levels, with most writes going to the smallest levels.

The Write Path Step by Step

Let’s trace what happens when you write a key-value pair PUT "user:1000" -> "Alice".

  1. WAL Append: The write is appended to the Write Ahead Log on disk. This is a sequential write to a single file, so it’s fast.
  2. Memtable Insert: The key-value pair is inserted into the active memtable’s sorted data structure.
  3. Memtable Full? If the memtable size exceeds its threshold, the system flags it as immutable and creates a new active memtable.
  4. Flush: A background thread picks up the immutable memtable and flushes it to disk as a new SSTable in L0. This flush is a sequential write of a sorted file. During flush, the SSTable is built with internal indexes.
  5. WAL Truncation: Once the flush is confirmed written to disk, the corresponding WAL segment is truncated.

That’s it. No random writes, no page splitting, no tree rebalancing. The write latency is dominated by the WAL append and the memory insertion, both extremely fast. Background flushes and compactions happen asynchronously.

The Read Path: A Multi-Level Treasure Hunt

Reading a key is more complicated. To find a key, the system must check:

  1. The active memtable.
  2. The immutable memtable(s) (if any being flushed).
  3. All SSTables in L0 (from newest to oldest, since overlaps exist).
  4. SSTables in L1, L2, etc. (only one file per level because they are non-overlapping; we can binary search the level’s index to find the candidate file).

If the key is not found, it’s absent. This sounds expensive—searching through potentially many SSTables—but optimizations reduce the cost.

Bloom Filters: Each SSTable can carry a Bloom filter that quickly tells us if the key might exist in that file. Bloom filters have zero false negatives (if a key is present, the filter always says “maybe present”) and a small configurable false positive rate (e.g., 1%). This avoids unnecessary disk reads for keys that are not in a given SSTable.

Block Cache: Frequently accessed data blocks are cached in memory (the block cache). When a read needs a data block from an SSTable, the cache is checked first.

Index Blocks: The index block of an SSTable is usually kept in memory. This allows us to locate the exact data block that might contain the key without an extra disk read.

Even with these optimizations, a read may have to check several SSTables in L0 (worst case: all of them) and one SSTable per level below. With Bloom filters, most checks are resolved in memory, and only 1-2 disk reads are typical.

Why LSM Trees Dominate Write-Heavy Workloads

The magic of the LSM tree is that it converts random writes into sequential writes. All file writes (flushes and compactions) append data sequentially. This means:

  • High write throughput: 100,000 to 1,000,000 writes per second on a single node are common.
  • Efficient use of disk bandwidth: Sequential writes saturate the media, whereas random writes leave bandwidth on the table.
  • Excellent compression: SSTables can be compressed block‑by‑block (e.g., Snappy, Zstd) because data is sorted, giving high compression ratios.

However, LSM trees pay a price: compaction. This is where the “hidden war” begins, and it’s all about write amplification.


The Brutal Economics of Write Amplification

Write amplification (WA) is the single most important metric for understanding the cost and performance of an LSM tree. It is defined as the ratio between the amount of data physically written to storage and the amount of data logically ingested by the application.

Write Amplification = (Bytes written to storage) / (Bytes inserted by application)

For example, if your application inserts 1 GB of data but the storage engine writes 10 GB to disk (including flushes, compactions, and internal copies), your write amplification is 10×.

Why Does Write Amplification Occur?

Every byte you insert or update into an LSM tree will be written multiple times throughout its lifetime:

  1. First write: Appending to the WAL (sequential, but still a write).
  2. Flush: Writing the memtable to an L0 SSTable.
  3. Compactions: As SSTables move down the levels, they are repeatedly merged and rewritten. A key may be compacted from L0 → L1, then L1 → L2, then L2 → L3, etc.

The number of times a single key is rewritten depends on the number of levels it passes through and the compaction strategy. In a typical leveled compaction scheme with a level multiplier of 10, a key might be rewritten 10-30 times over its lifetime.

Example: Leveled Compaction

Consider LevelDB’s default configuration:

  • L0 → 4 files, each 2 MB.
  • L1: 10 MB total (non-overlapping).
  • L2: 100 MB.
  • L3: 1 GB.
  • L4: 10 GB.

When a flush happens, a 2 MB file is written to L0. The first time it is compacted from L0 to L1, it is read (2 MB read) and merged with the overlapping files in L1 (say 10 MB read), producing new L1 files (12 MB written). So that original 2 MB has caused 2 MB read + 10 MB read + 12 MB write = 24 MB of I/O → WA so far = 12× just for one compaction step.

Then that 12 MB of new L1 data will eventually be compacted down to L2, causing even more I/O.

The total write amplification for a leveled LSM tree is approximately:

WA ≈ (L * (size_ratio - 1)) + 1

where L is the number of levels and size_ratio is the fanout between levels (typically 10). For 4 levels and ratio 10, WA ≈ (4 * 9) + 1 = 37×. That’s brutal: each logical write results in 37 physical writes!

Real-world numbers for RocksDB with leveled compaction are typically 10-30× depending on configuration.

The Cost of Amplification

Write amplification has a direct impact on:

  • SSD endurance: NAND flash has a limited number of program/erase cycles (e.g., 10,000 cycles for TLC, 100,000 for SLC). A WA of 20× means your SSD wears out 20 times faster than the logical write rate suggests. In a datacenter running hundreds of database nodes, this translates to frequent drive replacements.
  • I/O bandwidth: If your disk can write 500 MB/s sequentially, a WA of 20× means your effective logical write throughput is at most 25 MB/s. That might be far below your application’s needs.
  • CPU and memory: Compaction requires reading, sorting, merging, compressing, and CRC checking. This consumes CPU and temporarily uses large memory buffers.
  • Latency spikes: Compaction runs in the background, but if it competes for disk resources, foreground read/write latencies can suffer. This is the write stall problem—when the number of L0 files grows too large (because flushes are faster than compactions), the system throttles writes.

Measuring Write Amplification in Practice

You can observe WA in production databases. For example, in RocksDB, you can query the following statistics:

  • rocksdb.bytes.written – total bytes written to the database (including internal).
  • rocksdb.bytes.written by self – bytes written by compactions.
  • rocksdb.bytes.written by flushes – bytes written by flushes.
  • rocksdb.bytes.written by WAL – WAL writes.

By summing these and dividing by the number of logical keys written, you get the approximate WA.

Cassandra exposes the metric TotalWriteLatencyHistogram and WriteAmplification via nodetool compactions.

The Dilemma: High WA vs. Low WA

Why not just set the write amplification to 1? That would be ideal: every write to the application equals one write to disk. But achieving WA = 1 would require never doing compaction—just flushing chunks and never merging. That leads to an explosion of SSTables over time, making reads impossibly slow (you’d have to look through thousands of files). There is an intrinsic trade-off:

  • Low write amplification → many small SSTables, high read amplification, poor read performance.
  • High write amplification → fewer, well‑merged SSTables, fast reads, but slower writes and more wear.

The “right” WA is a function of your workload. Read-heavy systems can tolerate higher WA because reads benefit from fewer files. Write-heavy systems must minimize WA to maximize throughput and endurance.


Compaction: The Heart of the Beast

Compaction is the process that merges SSTables to keep the number of files under control, remove deleted keys, and maintain the level structure. It is the most complex and performance-critical part of an LSM engine. There are several compaction strategies, each with different WA and space usage trade-offs.

Size-Tiered Compaction (STC)

Used by Apache Cassandra (as the default for many years) and earlier versions of HBase.

How it works: Instead of maintaining strict levels, the system groups SSTables into “tiers” based on size. For example:

  • Tier 1: files of size ~2 MB
  • Tier 2: files of size ~20 MB
  • Tier 3: files of size ~200 MB
  • etc.

When a tier reaches a certain number of files (e.g., 4), all of them are compacted together into a single larger file in the next tier. The files within a tier may have overlapping key ranges.

Pros:

  • Very simple to implement.
  • Low read amplification for point reads (because only the most recent file in each tier may contain the key – but because of overlaps, you might have to check multiple files).
  • Can be efficient for workloads with many overwrites, because the compaction merges many versions at once.

Cons:

  • High space amplification: old files are not deleted until compaction completes, so you may use up to 2× the logical data size temporarily.
  • Large compaction bursts: when a tier reaches its threshold, it compacts all files simultaneously, causing a huge spike in I/O and CPU.
  • Write amplification can be high because each file is compacted several times as it moves through tiers. For example, a 2 MB file may be compacted with three others into a 20 MB file, then later that 20 MB file is compacted with three others into 200 MB, etc. WA can be 20-40× or higher.

Leveled Compaction

Used by LevelDB, RocksDB (default), and HBase (via LSM). This is the classic LSM approach with non-overlapping levels.

How it works:

  • Level 0: files flushed from memtable, overlapping allowed.
  • Level 1: non-overlapping files covering the entire key space. Total size = max_bytes_for_level_base.
  • Level 2: non-overlapping, total size = max_bytes_for_level_base * level_size_ratio (typically 10).
  • And so on.

Compaction is triggered when a level exceeds its size limit. A compaction picks a file from level L (or multiple files) and merges them with all overlapping files from level L+1. The resulting merged data is written as new files to level L+1.

Key invariant: Each level (except L0) has files that are strictly non-overlapping and sorted. This reduces read amplification: to find a key you check L0 (all files), then binary search L1’s file index for the one file that could contain the key, then one file in L2, etc.

Pros:

  • Low read amplification: only one file to check per level (plus all L0).
  • Predictable space usage (can be tuned to ~1.1× logical size).
  • Gradual compaction – small merges spread over time, avoiding huge I/O spikes.

Cons:

  • Higher write amplification than size-tiered for the same data volume (because data is rewritten many times as it descends levels).
  • Write amplification can be tuned by adjusting the level size ratio (e.g., 5 instead of 10 reduces WA but increases space usage and read amplification).

Tiered + Leveled Hybrid

Cassandra moved from pure size-tiered to a hybrid strategy called Tiered Compaction Strategy (TCS) and later TimeWindowCompactionStrategy (TWCS) for time-series data. RocksDB introduced Universal Compaction (a size-tiered variant) and FIFO Compaction (for time-bounded data).

Universal Compaction (RocksDB)

Universal compaction behaves like size-tiered but with a twist: it selects a contiguous range of files from sorted order (by creation time) and merges them into a single file. This results in a single sorted run of SSTables. Write amplification is typically lower than leveled (around 8-15×), but space amplification can be higher (up to 2×).

FIFO Compaction

Trivially simple: when total size exceeds a limit, delete the oldest SSTable. This is perfect for caching layers or time‑series data where you only care about recent data. WA = 1 (no rewriting at all), but reads of older data fail immediately. Extremely fast writes.

Compaction in Practice: RocksDB Examples

RocksDB’s configuration is highly parameterized. Here’s an example snippet for a write-heavy workload:

// Write heavy workload tuning
options.level_compaction_dynamic_level_bytes = true;
options.max_bytes_for_level_base = 256 * (1ULL << 20); // 256 MB
options.max_bytes_for_level_multiplier = 8;
options.level0_file_num_compaction_trigger = 4;
options.level0_slowdown_writes_trigger = 8;
options.level0_stop_writes_trigger = 12;
options.target_file_size_base = 64 * (1ULL << 20); // 64 MB
options.target_file_size_multiplier = 2;
options.compaction_style = kCompactionStyleLevel;
  • max_bytes_for_level_base controls L1 size.
  • max_bytes_for_level_multiplier = 8 means L2 = 2 GB, L3 = 16 GB, etc.
  • level0_file_num_compaction_trigger = 4 means after 4 L0 files, a compaction is triggered.
  • level0_slowdown_writes_trigger = 8 means if there are 8+ L0 files, writes are slowed down (throttled) to give compaction time to catch up.
  • level0_stop_writes_trigger = 12 means if L0 files reach 12, writes are completely stopped until compaction reduces the count. This protects against write stalls.

For a write-heavy workload with low read latency requirements, you might increase level0_file_num_compaction_trigger to reduce compaction frequency (but more L0 files hurt reads). Or you might use universal compaction:

options.compaction_style = kCompactionStyleUniversal;
options.compaction_options_universal.size_ratio = 10; // 10% size ratio
options.compaction_options_universal.min_merge_width = 4;
options.compaction_options_universal.max_merge_width = 8;

Compaction and Space Amplification

Space amplification is the ratio between the total on-disk data size and the logical data size. It is caused by:

  • Multiple versions of the same key (old values not yet compacted away).
  • Deleted keys (tombstones) that haven’t been purged.
  • Space occupied by compaction input files before they are deleted.

In leveled compaction, space amplification can be as low as 1.1× (10% overhead) if you tune well. In size-tiered, it can be 2× or more.

The Compaction I/O Pattern: Sequential, but Expensive

Even though compactions are sequential I/O, they still consume bandwidth. A compaction reads one or more SSTables, merges them in memory, and writes new ones. The read and write traffic is roughly proportional to the amount of data being compacted.

Modern SSDs have limited endurance (measured in Total Bytes Written, TBW). If you run a database with WA = 20× and ingest 1 TB/day, your SSD sees 20 TB/day of writes. A typical datacenter SSD with 1 PBW endurance will last 1,000 / 20 = 50 days! That’s unacceptable, so you either need to reduce WA or use higher‑endurance drives (like Optane or enterprise SLC) which are prohibitively expensive.

This is why WA is an economic issue: it directly ties your storage hardware cost to your write throughput.


Read Path Optimization

While writes are the primary focus of LSM trees, reads must not be neglected. LSM trees are notorious for read amplification – the number of disk reads required to satisfy a query. Without optimizations, a point read might need to check all L0 files plus one file per level, resulting in dozens of random reads. Fortunately, there are several effective optimization techniques.

Bloom Filters: The Indispensable Lie

Bloom filters are a probabilistic data structure that can tell you “this key is definitely not in this SSTable” or “this key might be in this SSTable.” They have no false negatives and a configurable false positive rate (e.g., 1%).

In an LSM tree, each SSTable can have a Bloom filter. When reading a key, the system checks:

  1. Memtable (in‑memory, fast).
  2. Immutable memtable (in‑memory, fast).
  3. For each SSTable (starting from L0 newest to oldest):
    • Check the Bloom filter. If it says “no,” skip the file with zero I/O.
    • If “maybe,” read the index block (if not cached) to locate the data block, then read the data block and retrieve the key.

Bloom filters dramatically reduce disk reads. With a 1% false positive rate, you will unnecessarily read only about 1% of the files that don’t contain the key. This is often enough to make point reads require 1-2 disk I/Os, even deep into the LSM tree.

Block Cache and Filter Cache

RocksDB maintains two caches:

  • Block Cache (LRU): Caches decompressed data blocks from SSTables. Size configurable, typically 10-50% of total memory.
  • Filter/Index Cache: Caches Bloom filters and index blocks. These are usually much smaller than data blocks.

By caching frequently accessed blocks, reads become in‑memory hits, avoiding disk entirely.

Range Scans and Prefix Seeks

Range scans (e.g., Scan("user:1000", "user:2000")) are more expensive because they must traverse all SSTables for the entire range. LSM trees handle this by iterating over SSTables in key order. For each SSTable, you open an iterator and merge it with others. This is done by a merge iterator that reads from multiple SSTables in parallel and yields keys in sorted order.

RocksDB’s merge iterator is efficient: it uses a min‑heap over the current keys from each active SSTable. However, if you have many L0 files, the heap becomes large, and reading from many files can cause slow range scans.

Prefix Seek: Some workloads only query by prefix (e.g., “user:1000_” all keys with that prefix). RocksDB supports prefix Bloom filters and prefix iterators that can skip entire file sections not matching the prefix.

Read Amplification Calculation

For a point read (single key lookup), read amplification can be approximated as:

Read Amplification = (#L0 files) * (1 - false_positive_rate) + (#levels - 1) * (1 - false_positive_rate)

Assuming Bloom filters with 1% false positive, and 4 L0 files and 4 levels, that’s approximately 40.01 + 30.01 = 0.07 disk reads on average. That’s less than one! But worst case (all filters say “maybe”) gives 4 + 3 = 7 disk reads. So Bloom filters turn worst-case reads into a small probabilistic cost.

However, range scans are not helped by Bloom filters; they must read all relevant data blocks. So read amplification for range queries is proportional to the number of files and the size of the range.


Real-World Implementations and Trade-offs

Understanding the theory is one thing; seeing how the trade-offs play out in actual systems is another. Let’s examine three major implementations.

LevelDB: The Ur-LSM

LevelDB was created by Google’s Jeff Dean and Sanjay Ghemawat in 2011. It’s a standalone library (not a server) written in C++. It introduced many of the ideas we now take for granted: leveled compaction, Bloom filters, block cache, and the concept of a multi-level LSM tree.

Key characteristics:

  • Single writer (sequencer). Writes go through a single thread to avoid concurrent modification.
  • Block size default 4 KB.
  • Snappy compression.
  • Background thread for compaction.

LevelDB is remarkably simple and stable, but it has limitations: no concurrent reads during compactions? Actually reads are concurrent with compactions. The major limitation is the use of mmap for reading (on non‑Linux platforms) and limited configurability.

RocksDB: The Enterprise LSM

Facebook forked LevelDB in 2013 to create RocksDB, optimized for high‑performance storage on fast SSDs and memory. RocksDB can be found in MySQL (MyRocks), MongoDB (RocksDB storage engine), Cassandra (since 4.0), and countless other systems.

Key enhancements over LevelDB:

  • Concurrent memtable insertion: Multiple writers can add to memtable concurrently using WriteBatch with thread‑local buffers.
  • Write stalls: Configurable thresholds to slow or stop writes to prevent compaction from falling behind.
  • Column families: Logical partitions within a single database, each with its own memtable and SSTables.
  • Merge operators: Instead of simply overwriting values, you can define a merge operation (e.g., increment a counter) that is applied during compaction.
  • Tiered storage: Ability to designate different storage types for different levels (e.g., hot data on NVMe, cold data on HDD or S3).
  • Compaction filters: User‑defined functions that can inspect and discard keys during compaction (e.g., drop expired entries).
  • Multi-threaded compaction: Compactions can run in parallel (controlled by max_background_compactions).

Typical tuning for write-heavy:

  • Use leveled compaction with level_compaction_dynamic_level_bytes (target level sizes are computed dynamically based on total data size, reducing write amplification).
  • Increase write_buffer_size to 128 MB or 256 MB to reduce flush frequency.
  • Set min_write_buffer_number_to_merge = 2 to merge two memtables before flushing (reduces LO files).
  • Use compaction_options_fifo if data is short‑lived.

Apache Cassandra: Distributed LSM

Cassandra is a distributed key‑value store that uses a distributed hash ring plus local LSM trees on each node. It was designed from the ground up for write scaling.

Key aspects:

  • Partitioning: Data is partitioned by hash of the partition key. Each node stores a subset of partitions.
  • Local storage: Each node runs a local storage engine (originally pure LSM, now supports multiple compaction strategies).
  • Compaction Strategies: SizeTieredCompactionStrategy (default for years), LeveledCompactionStrategy, TimeWindowCompactionStrategy, etc.
  • Hinted Handoff: Writes are buffered if a replica is down; later replayed.

Cassandra’s write path is extremely fast because it only appends to a commit log and writes to memtable. The distributed nature adds complexity (repair, anti-entropy), but the LSM core remains.

Other Notable Implementations

  • WiredTiger (MongoDB’s storage engine): Uses B‑trees for some workloads and LSM trees for others.
  • ScyllaDB: C++ reimplementation of Cassandra with a shard-per-core architecture; uses its own LSM variant.
  • ClickHouse: MergeTree table engine is an LSM‑like structure for analytics.

Advanced Topics

Write Stalls and Backpressure

The number one operational headache with LSM trees is write stalls. When compaction cannot keep up with the rate of flushes, L0 file count grows. To prevent excessive read amplification, the system imposes backpressure:

  • Slowdown: Writes are delayed (e.g., 1 ms delay per write) to reduce the influx.
  • Stop: Writes are blocked entirely until compaction catches up.

Write stalls are latency spikes that can be catastrophic for applications requiring low tail latencies (e.g., 99.9th percentile < 10 ms). Mitigations:

  • Increase level0_slowdown_writes_trigger and level0_stop_writes_trigger (at the cost of higher read amplification during bursts).
  • Use dynamic level sizes to spread data more evenly.
  • Allocate more CPU and I/O to compaction (rate_limiter in RocksDB can throttle compaction I/O to avoid starving reads, but that may worsen stalls).
  • Use a larger memtable size to reduce flush frequency.

Monotonically Increasing Keys

If your keys are monotonically increasing (e.g., timestamps, auto‑increment IDs), LSM trees become exceptionally efficient. Why? Because flushes produce L0 files with disjoint key ranges (since new keys are always greater than old ones). Compaction becomes a simple append rather than a merge with overlapping files. Read amplification also drops because newer SSTables at higher levels are immediately placed at the end.

This is why time‑series databases (InfluxDB, TimescaleDB) use LSM trees heavily. Write amplification can drop to near 1× if you use FIFO compaction or specialized strategies.

Merge Operators

In an LSM tree, every update creates a new version. If you want to increment a counter without reading the old value, you’d normally need to issue a read-modify-write. But with merge operators, the client sends a “merge” command (e.g., “add 5 to counter 0x123”). During compaction, the system applies the merge operator to the base value and accumulates multiple merge operations into one. This dramatically reduces write amplification for update‑in‑place workloads.

RocksDB includes several built‑in merge operators (e.g., for string concatenation, integer addition). You can also define custom ones.

Tiered Storage (Hot/Cold)

In a classic LSM, all levels sit on the same storage. However, because higher levels (L3, L4, …) hold older, less frequently accessed data, you can place them on slower, cheaper storage (e.g., HDD or S3) while keeping lower levels on fast NVMe. This is called tiered storage or multitier compaction.

RocksDB supports blob_storage for large values, and the newer Pyramid compaction concept allows moving files between tiers. Cassandra has LeveledCompactionStrategy with the option to configure different directories per level.

The challenge: compaction between different tiers may involve cross‑network I/O, causing high latency. But for many workloads, the cost savings (in $/GB) are enormous.


Tuning for Your Workload

There is no universal “best” configuration for an LSM tree. The optimal settings depend on whether your workload is:

  • Write‑heavy vs. read‑heavy
  • Point reads vs. range scans
  • Small values vs. large values
  • Uniform keys vs. monotonically increasing
  • Transient data vs. permanent

A practical tuning checklist:

  1. Memory budget: Allocate a portion of RAM to block cache (for reads) and a portion to memtable(s) (for writes). A typical ratio is 30% memtable / 70% block cache for write‑heavy systems, or 10% / 90% for read‑heavy.
  2. Compaction strategy: Use leveled for read‑heavy workloads where you want low space amplification and fast point reads. Use universal (size‑tiered) for write‑heavy workloads where you can tolerate more space and slower reads. FIFO for ephemeral data.
  3. Level sizes: For leveled, use dynamic_level_bytes to adapt to actual data size. Set max_bytes_for_level_multiplier based on your WA tolerance: smaller multiplier (e.g., 5) reduces WA but increases number of levels (slower reads).
  4. Bloom filters: Set bloom_locality = 1 for a cache‑friendly filter. False positive rate: 1% is a good default; for write‑heavy you can increase to 2% to save memory.
  5. Block size: Larger blocks (16 KB vs 4 KB) improve sequential read bandwidth but increase read amplification for small point queries. For point reads, use 4 KB; for range scans, 16 KB+.
  6. Compression: Use compression that balances CPU and space. For write‑heavy, avoid heavy compression (LZ4 instead of Zstd). For read‑heavy with large values, Zstd gives better compression at higher CPU cost.
  7. Parallelism: Increase max_background_compactions and max_background_flushes to utilize multiple CPU cores and disk queues. But be careful: too many compactions can saturate I/O, starving reads.
  8. Rate limiter: In RocksDB, you can set a RateLimiter to cap compaction I/O, preventing it from hogging bandwidth. This ensures predictable foreground latency.

Example Configuration for Write‑Heavy Time‑Series

Options options;
options.create_if_missing = true;
options.write_buffer_size = 256 * (1ULL << 20); // 256 MB
options.max_write_buffer_number = 4;
options.min_write_buffer_number_to_merge = 2;
options.compression = kLZ4Compression;
options.bottommost_compression = kSnappyCompression;
options.level_compaction_dynamic_level_bytes = true;
options.max_bytes_for_level_base = 512 * (1ULL << 20);
options.max_bytes_for_level_multiplier = 8;
options.level0_file_num_compaction_trigger = 8;
options.level0_slowdown_writes_trigger = 16;
options.level0_stop_writes_trigger = 24;
options.max_background_compactions = 4;
options.max_background_flushes = 2;
options.table_factory.reset(NewBlockBasedTableFactory(
    BlockBasedTableOptions()
        .block_cache = NewLRUCache(1 * (1ULL << 30)) // 1 GB cache
        .filter_policy.reset(NewBloomFilterPolicy(0.01, false))
        .block_size = 4096
));

This configuration gives high write throughput (large memtable, aggressive flush threshold, multiple background threads) while maintaining reasonable read performance (dynamic levels, Bloom filters, block cache).


Conclusion: The Hidden War Continues

We began with a librarian struggling to keep up with a torrent of updates. The LSM tree is the librarian’s ultimate weapon: it accepts writes in a frenzy, sorts them later, and emerges with calm, sorted order. But behind that serenity, the hidden war of write amplification rages on. Every byte you push into the system is multiplied by forces you cannot see—compaction, merging, rewriting—and those forces exact a toll on your hardware, your latency, and your budget.

The key insight is that there is no free lunch. The LSM tree’s strategy of batching writes and sequentially merging them is brilliant for write throughput, but it comes at the cost of background machinery that can stall your writes, wear out your SSDs, and confuse your capacity planning. To be a master of your database’s fate, you must understand write amplification, tune compaction, and choose strategies that align with your workload’s economics.

We have covered:

  • Why traditional B‑trees and hash tables fail for write‑heavy workloads.
  • How LSM trees convert random writes into sequential ones.
  • The definition and measurement of write amplification and its brutal economics.
  • The mechanics of compaction—size‑tiered, leveled, and hybrid strategies.
  • Optimizations for the read path: Bloom filters, block caches, and merge iterators.
  • Real‑world examples from LevelDB, RocksDB, and Cassandra.
  • Advanced topics: write stalls, merge operators, monotonically increasing keys, and tiered storage.
  • Practical tuning guidelines for common workloads.

The next time you see a database performance problem—latency spikes, SSD failures, unexplained high I/O—remember the hidden war. Look at your compaction logs, measure your write amplification, and ask yourself: What would the librarian do? With the knowledge from this deep dive, you can now answer that question.

And the war? It will never end. As hardware evolves (persistent memory like Intel Optane, Zoned Namespace SSDs, computational storage) and as workloads grow (AI training pipelines, real‑time fraud detection), the storage engines will continue to adapt. The principles of LSM—sequential writes, asynchronous merging, and the trade‑off between write and read amplification—will remain fundamental. They are the bedrock upon which the modern data stack is built.

Now go forth and optimize. The librarian is counting on you.