The Design And Implementation Of A Database Index Using Fractal Trees (cache Oblivious B Trees)

2025-07-29 · Leonardo Benicio

Summary

A comprehensive technical exploration of the design and implementation of a database index using fractal trees (cache oblivious b trees), covering key concepts, practical implementations, and real-world applications.

Contents

The Database Index That Cheats Time: Why Fractal Trees Are Rewriting the Rules of Data Storage

An in‑depth, 10,000‑word exploration of how a 1990s MIT idea – the fractal tree – challenges the fifty‑year reign of the B‑tree, slashes write amplification, and quietly powers some of the world’s most demanding data systems.

1. Introduction: The Hidden Tax on Every Write

Every time you query a database, you are placing a bet. You are betting that the index structure will find your data faster than a raw sequential scan. For fifty years, that bet has been placed on a single family of data structures: the B‑tree and its variants. B‑trees are elegant, predictable, and deeply embedded in nearly every relational database, key‑value store, and file system. But they have a hidden cost—a cost that grows heavier with every record you insert, update, or delete. That cost is write amplification.

Consider a typical B‑tree insertion. You navigate from root to leaf, splitting nodes when they overflow, rewriting four or five disk pages along the way, and often paying a random I/O for each. On spinning disks, a random write costs about 10 ms; even on NVMe SSDs, it’s a few microseconds—far slower than a thousand CPU cycles. Multiply that by millions of writes per second, and the B‑tree’s once‑trivial overhead becomes a bottleneck that throttles entire data pipelines. The database industry has long accepted this tax as inevitable. But it isn’t.

In the late 1990s, a group of computer scientists at MIT asked a radical question: What if an index could “remember” writes instead of immediately applying them? What if it batched them, delayed the sorting, and amortized the I/O cost over many operations? The result was the cache‑oblivious B‑tree—a structure later popularised as the fractal tree. Fractal trees turn the B‑tree inside out. They keep the same logarithmic search path, but they inject a buffer into every internal node. Instead of pushing each new key deep into the tree right away, they store it in the nearest available buffer. When that buffer fills up, the entire batch is flushed to its children in one efficient I/O. This simple change slashes the number of random writes per insertion from O(log N) to O(log N / B) (where B is the block size), and in practice to O(1) amortized random writes.

But the fractal tree is more than just a clever trick. It represents a fundamental shift in how we think about the trade‑off between read and write performance. In this article, we’ll break down exactly how fractal trees work, why they outperform B‑trees in write‑heavy workloads, where they fall short, and why they are already the secret engine behind some of the most scalable databases in production (including TokuDB, RocksDB‑like LSM‑tree variants, and even certain NoSQL systems). We will also explore the mathematics of write amplification, the cache‑oblivious model, and the engineering challenges that make implementing a fractal tree far from trivial.

By the end, you will understand why the fractal tree is not merely an academic curiosity but a practical weapon in the fight against the I/O bottleneck—and why it might just be the index that “cheats” time.

2. The B‑tree’s Dirty Secret: Write Amplification

2.1 A Quick Refresher on B‑trees

Before we can appreciate the fractal tree’s trick, we need to recall why B‑trees are so successful. A B‑tree is a balanced tree where each node (typically a disk page of 4 KB to 16 KB) contains multiple keys and pointers to child nodes. The branching factor—the number of children per node—is high (hundreds or thousands), so the tree depth is small: for a billion‑record index, a B‑tree with a fan‑out of 500 has a depth of only about 4 or 5. This shallow depth means that any key lookup requires at most 4 or 5 random I/Os, which is excellent for reads.

But writes are a different story. When you insert a new key, you must:

Find the leaf node where the key belongs (a read path that already costs 4–5 I/Os, but those are usually cached).
Insert the key into the leaf node. If the leaf is not full, you simply write the modified page back (one random write).
If the leaf is full, you split it into two nodes, which involves:
- Writing the new leaf page.
- Writing the old leaf page (now half full).
- Updating the parent node to add a new separator key (potentially causing the parent to split recursively).
- In the worst case, a split can propagate all the way up to the root, affecting O(log N) pages.

Thus, a single insertion may cause 4–5 page writes (even more if the tree is deep), and each write is random because the modified pages are scattered across the disk. For a high‑throughput write workload, these random writes become the dominant cost.

2.2 Measuring Write Amplification

Write amplification factor (WAF) is defined as the ratio of the number of bytes written to storage to the number of bytes actually inserted by the application. For a B‑tree, a typical insertion of a 16‑byte key‑value pair might cause 4 KB page writes. If the insertion triggers a split, you might write 4 pages (16 KB) for 16 bytes of data—a WAF of 1024. Even in steady‑state without splits, you still write the leaf page (4 KB) for every insertion, giving a WAF of 256.

Modern SSDs have limited write endurance; high WAF shortens drive life. Moreover, random writes are much slower than sequential writes because SSDs must erase and program entire flash blocks (typically 512 KB–4 MB). Random writes cause write amplification inside the SSD controller as well (intra‑SSD WAF). The B‑tree’s write pattern amplifies this further.

2.3 The Industry’s Old Solution: LSM‑Trees

To mitigate B‑tree write amplification, many modern databases (LevelDB, RocksDB, Cassandra, ScyllaDB, InfluxDB) use Log‑Structured Merge‑Trees (LSM‑trees). An LSM‑tree buffers writes in a memory‑resident sorted structure (memtable) and periodically flushes it to disk as a sorted immutable file (SSTable). Over time, multiple SSTables are merged in the background via compaction. This turns random writes into sequential writes (during flush) and dramatically reduces WAF compared to a B‑tree under random inserts.

However, LSM‑trees introduce their own problems: read amplification (you may have to search multiple SSTables) and compaction overhead (temporary space amplification, I/O bursts). The fractal tree offers a different path—one that preserves B‑tree‑like read performance while achieving LSM‑tree‑like write efficiency, without the compaction cascades.

3. The Birth of the Fractal Tree: From MIT to Production

3.1 The Cache‑Oblivious Model

In the mid‑1990s, MIT researchers Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran introduced the concept of cache‑oblivious algorithms—algorithms that perform well on any memory hierarchy without explicit knowledge of the cache line size, block size, or number of cache levels. The key idea is to design algorithms that use memory in a recursive, divide‑and‑conquer fashion so that data accesses naturally exploit spatial and temporal locality regardless of the hardware parameters.

Soon after, a team led by Michael Bender (then at SUNY Stony Brook, later at UT Austin), Erik Demaine (MIT), and others applied cache‑oblivious principles to search trees. Their seminal 2000 paper, “Cache‑Oblivious B‑Trees” (Bender, Demaine, Farach‑Colton), presented the cache‑oblivious B‑tree—the first data structure to achieve optimal B‑tree performance (O(logᵦ N) search with O((log N)/B) amortized insertion cost) using only static arrays and no explicit block sizes.

But the early cache‑oblivious B‑tree was complex and not immediately practical. Over the following years, the same research group simplified the idea, leading to the structure we now call the fractal tree (also known as the buffer tree or streaming B‑tree). The name “fractal” comes from the self‑similar nature of the buffers: each internal node contains a buffer, and those buffers are themselves structured like smaller fractal trees.

3.2 Tokutek and the Commercial Fractal Tree

The most prominent commercial implementation of fractal trees is TokuDB, a storage engine for MySQL and MariaDB developed by Tokutek (later acquired by Percona). Tokutek’s engineers, led by Bradley Kuszmaul (who co‑authored many of the fractal tree papers), took the theoretical design and turned it into a robust, transaction‑aware, crash‑recoverable index. TokuDB uses a fractal tree variant called Fractal Tree® Index (also referred to as TokuDB’s cache‑oblivious B‑tree). It delivers up to 50 x faster writes than InnoDB on certain workloads, while maintaining comparable read performance.

Today, fractal trees are less well‑known than LSM‑trees, but they are far from obscure. They power applications where write throughput is critical: time‑series databases, log analytics, financial tick databases, and high‑volume event processing. And they continue to inspire new research in cross‑layer optimizations, including using persistent memory and NVRAM.

4. How Fractal Trees Work: The Anatomy of a Time‑Bending Index

4.1 The Core Idea: Message Buffers in Internal Nodes

In a conventional B‑tree, each internal node simply contains a sorted array of keys and pointers. When a new key arrives, it must immediately be placed in the correct leaf. That immediate placement causes the node splits and cascading writes.

In a fractal tree, every internal node contains an additional buffer (also called a message buffer or insert buffer). The buffer is a small, unsorted (or partially sorted) array that temporarily holds pending inserts, deletes, and updates that are destined for nodes deeper in the tree. Instead of propagating a write all the way to the leaf, the new key is simply appended to (or merged into) the buffer of the current node—usually the root, or whichever internal node is reached first.

When the buffer fills up (exceeds a threshold), the node flushes its buffer to its children. The flush operation takes all buffered messages, sorts them (by destination child), and writes them into the buffers of the appropriate children. If a child’s buffer then becomes full, it too flushes, and so on. This flushing happens in a batch, one sequential write per child, rather than many random writes.

The result is that a single insertion does not travel deep into the tree immediately. Instead, it trickles down through the buffers over several flush cycles, eventually reaching a leaf. The amortized number of random I/Os per insertion is O((log N)/B) – and with careful sizing of buffers, it can be reduced to O(1) in practice. For a typical B‑tree, the random I/O cost is O(log N). The reduction is dramatic.

4.2 Buffered Flush: The Secret Sauce

Let’s walk through a concrete example.

Assume a fractal tree with a fan‑out of 1000 and a buffer size of 2000 messages per internal node. When we insert one record, we add it to the root’s buffer. After 2000 insertions, the root’s buffer is full. At that point, we:

Sort the 2000 messages by the child they need to go to.
Write a batch to each child’s buffer in sequence. If each child receives on average 2 messages (2000 / 1000), each child might need only a small write. However, the root performs only one sequential write per child (or a single large sequential write containing all messages, which the children then parse). In practice, the data is written as a contiguous block per child, so the disk sees mostly sequential writes.
The children’s buffers grow. Eventually, a child accumulates enough messages to trigger its own flush, and so on.

Thus, the I/O is batched and sequential. The total number of disk writes per inserted record is amortized over many records: each flush writes O(B) data for every B messages, leading to an amortized O(1/B) writes per record, which is essentially O(1) per record (since B is typically hundreds or thousands).

4.3 Comparison with B‑tree I/O Patterns

Aspect	B‑tree	Fractal Tree
Insert path	Immediate root-to-leaf	Buffered, lazy propagation
Random writes per insert	O(log N)	O(1) amortized (or O((log N)/B))
Write amplification	Very high (often 100–1000×)	Low (often 5–20×)
Read amplification for point lookups	O(log N) random reads	O(log N) random reads (but may be slightly higher due to buffer checks)
Space overhead	Low (only keys and pointers)	Moderate (extra buffer space)

4.4 Why “Fractal”? Self‑Similarity

The name “fractal” reflects the self‑similar structure of the buffers. If you zoom in on a fractal tree’s internal node, you see a smaller version of the whole tree: it has its own buffer, and its children also have buffers, etc. This recursive buffering scheme is analogous to a fractal pattern that repeats at different scales.

Mathematically, the tree can be seen as a multi‑level buffer where each level buffers writes for the levels below. This self‑similarity enables cache‑obliviousness: the tree performs well without knowing the block size, because the flush mechanism naturally adapts to any page size.

5. Mathematical Performance Analysis

5.1 Amortized I/O Cost for Insertions

Let’s derive the amortized cost. Suppose the tree has height h, fan‑out f, and each internal node has a buffer of size B (in number of messages). The total number of messages stored in the tree at any time is at most B × (number of internal nodes). A typical fractal tree with N leaves might have about N/(f‑1) internal nodes, so total buffer capacity is O(NB/f). This is modest compared to N if B is small relative to f.

When a buffer fills, it is flushed to its children. Each flush writes out B messages, but these messages get distributed among f children. The cost of a flush is O(f × cost of writing one block to disk) – but if we use sequential writes, the cost is closer to O(B) because the data is contiguous. For simplicity, assume each flush costs O(B) I/O (one block write). Over the lifetime of the tree, each message participates in at most h flushes (once per level from root to leaf). So total I/O for all inserts is O(N × h × something). However, because flushes are batched, the amortized cost per insertion is O(h / B). Since h = logᵥ(N) and B is a constant (or grows slowly), we get O(log N / B). For typical parameters (B = 1000, log N = 30), this is 0.03 random writes per insert—essentially zero.

Contrast with a B‑tree: each insert causes O(log N) random writes (when splits occur) or at least 1 random write (the leaf page). So fractal trees can be hundreds of times more write‑efficient under high‑insert workloads.

5.2 Read Cost: Not Quite Free

The buffering that helps writes hurts reads slightly. When you search for a key, you must traverse the root‑to‑leaf path, but at each internal node, you also need to check the buffer to see if any pending messages affect your search. For a point lookup, you must scan the buffer (which may be unsorted, requiring a full scan or a binary search if partially sorted). If the buffer is large, this adds O(B) time per level. However, B can be kept small (e.g., 64–256 entries) relative to fan‑out, so the overhead is manageable. In practice, the fractal tree can still achieve O(log N) random reads for point lookups, similar to a B‑tree, but with slightly more CPU overhead.

For range scans, the story is more complex. The fractal tree must merge sorted data from leaves with buffered messages that may not yet be in leaves. This requires a priority‑queue‑like merge across multiple buffers, adding complexity and sometimes extra I/O. However, optimizations (like logarithmic merging or using a smaller buffer at each level) can keep range scans efficient.

5.3 Update and Delete Operations

Deletes and updates are implemented as “tombstone” messages just like in LSM‑trees. A delete message is inserted into the buffer and later propagated to the leaf, where the record is removed (or marked). Because the tree buffers messages, a delete also benefits from batching. This means that a workload with many deletes avoids random writes as well.

6. Fractal Trees in Practice: Case Studies and Implementations

6.1 TokuDB: The First Mainstream Fractal Tree Database

TokuDB, built on MySQL and MariaDB, chose the fractal tree as its primary index structure. In benchmarks comparing TokuDB vs. InnoDB (B‑tree), TokuDB often showed:

10–50× faster inserts under high concurrent write loads.
70–80% less storage due to lower write amplification (fewer page writes and less internal fragmentation).
Comparable read performance for point queries and moderate scans.
Slower range scans on very large ranges (because of buffer traversal overhead).

TokuDB’s success in high‑write environments (e.g., Zabbix monitoring, time‑series, and e‑commerce analytics) demonstrates that fractal trees are not merely theoretical. However, TokuDB is not a silver bullet. Its complexity leads to higher memory usage (for buffers) and more CPU overhead per query. The engineering effort to maintain crash recovery, transactionality, and concurrency control in a fractal tree is substantial.

6.2 Rocket Science: Fractal Trees for HPC and Big Data

Beyond TokuDB, fractal tree‑inspired designs appear in high‑performance computing and big data platforms. For instance:

Pivotal Greenplum uses a variant of fractal trees for its append‑optimized tables.
Some research projects on persistent memory (e.g., Intel Optane PMem) have adopted fractal tree ideas to bridge the performance gap between DRAM and NVM.
LinkedIn’s Voldemort (a distributed key‑value store) experimented with fractal‑inspired buffering for writes.

6.3 Behind the Scenes: The Fractal Tree vs. LSM‑Tree Debate

A common question: Why use a fractal tree instead of an LSM‑tree, given that LSM‑trees are far more widespread? The answer lies in the read‑vs‑write trade‑off.

LSM‑trees are optimized for sequential write throughput but suffer from read amplification (you may have to check multiple SSTables) and compaction storms (bursty I/O during merges). Fractal trees avoid compaction entirely—they have no separate merge process. Instead, flushing is amortized and piggybacks on the existing tree structure. This leads to more predictable I/O behavior and lower read amplification for point lookups (since you only need to search one path plus check buffers). However, fractal trees are harder to implement correctly and have higher memory overhead per node (the buffers).

Which one is “better” depends on the workload:

Write‑heavy + read‑occasional → LSM‑tree is often simpler.
Write‑heavy + read‑heavy point queries → fractal tree can be better because reads are not penalized by compaction.
Write‑heavy + large range scans → LSM‑tree may win (especially with size‑tiered compaction).
Workloads with many deletes → fractal tree handles them similarly to inserts.

6.4 Modern Adaptations: Fractal Trees in the Cloud

Cloud storage systems often use blob stores (S3, GCS) that favor large sequential writes. Fractal tree flushes, which write large batches sequentially, are a natural fit. Some cloud‑native databases (e.g., YugabyteDB, CockroachDB) incorporate buffered write strategies that borrow from fractal tree ideas. For example, YugabyteDB’s DocDB uses an LSM‑tree with a separate “intent” buffer (for transactions) that shares similarities.

7. Engineering a Fractal Tree: Challenges and Optimizations

Building a production‑grade fractal tree is non‑trivial. Here are some of the key challenges.

7.1 Buffer Management

Where do you store the buffers? In memory or on disk? If in memory, you lose durability. If on disk, you add read latency for checking buffers during queries. Most implementations keep the top‑level buffers (root and near‑root) in memory, while deeper buffers may be paged out. This is similar to a B‑tree’s page cache.

Determining the optimal buffer size per node is critical. Too small → flushes happen too often, increasing I/O. Too large → memory pressure and slower point lookups.

7.2 Concurrency and Locking

Fractal trees have internal mutation because flushes change node contents. Concurrency control becomes tricky. One approach is to use a fractal tree with optimistic locking: during a flush, the node is briefly locked to serialize buffer updates. Another approach is to use latch‑free structures (like the Bw-tree used in Hekaton, Microsoft’s in‑memory OLTP engine, which also uses delta updates). The Bw‑tree, while not a fractal tree, uses indirection and “update records” similar to messages.

7.3 Crash Recovery

Fractal trees can have messages in buffers that are not yet flushed to leaves. On a crash, the tree must be able to reconstruct its state. This is typically done with a write‑ahead log (WAL) that records each insertion (or batch of insertions) before they are placed in buffers. On recovery, the WAL is replayed, and the tree is rebuilt by re‑inserting messages into buffers and flushing as needed. This is more complex than B‑tree recovery because you cannot simply replay a page‑level redo log.

TokuDB uses a two‑phase approach: it logs logical operations (insert, delete) and also periodically checkpoints the fractal tree metadata.

7.4 Handling Overflows and Rebalancing

Like a B‑tree, a fractal tree may need to split or merge nodes when the number of children changes. A split in a fractal tree is more involved because you must also divide the parent node’s buffer between the two new children. This is done by scanning the parent’s buffer and redistributing messages accordingly.

Similarly, node merges (when children become half‑empty) require merging buffers. These operations add complexity but are infrequent enough that they don’t dominate performance.

7.5 Compaction vs. Flushing: A Confusion

Wait—does the fractal tree do any compaction? Actually, no. In a fractal tree, leaves themselves are just B‑tree leaves (sorted lists of key‑value pairs). When a leaf becomes too large, it splits. Over time, as records are deleted, leaves may have tombstone entries. The fractal tree does not reclaim space automatically except through leaf splits and merging. TokuDB performed a form of “garbage collection” during flushes: when a flush sends a delete message, the leaf can purge the record immediately. This is analogous to compaction but done on a per‑message basis without full‑table merges.

7.6 Key Optimizations

Buffer compression: Storing messages in a compressed format reduces I/O and memory.
Piggybacking flushes: When a node is already being read from disk (during a query), you can opportunistically flush its pending messages.
Heterogeneous buffer sizes: Use larger buffers at higher levels (more writes) and smaller buffers near leaves (faster reads).

8. Academic Context: Related Work and Theoretical Extensions

The fractal tree is part of a broader family of buffered data structures. Some notable relatives:

Buffer trees (Arge 1995): A classic external‑memory data structure that uses a global buffer for insertions and periodic flushes in a B‑tree‑like manner. The original buffer tree is simpler but not cache‑oblivious.
Write‑optimized B‑trees (WO‑trees): A family that includes the Bε‑tree (a variant of the fractal tree with more rigorous analysis) and the broom tree (which uses ephemeral buffers).
Stepped merge tree (a hybrid LSM + fractal tree): Used in some research prototypes to combine the advantages of both.

A key theoretical result: The Bε‑tree (another name for a fractal tree) achieves a trade‑off between read and write cost. The trade‑off is quantified by a parameter ε (0 < ε ≤ 1). By tuning ε, you can make the tree more write‑oriented (ε small) or more read‑oriented (ε large). This tunability is a powerful tool for workload‑specific optimizations.

8.1 The Bε‑Tree Parameter

Let the buffer size of an internal node be Bε. For a B‑tree, buffer size is 0 (ε = 0). For a pure fractal tree, ε is typically 1/2 or 1 (buffer size same as block size). Increasing ε reduces write amplification but increases buffer space and read overhead. In practice, ε ≈ 0.5 (buffer half the block size) is common.

8.2 Cache‑Obliviousness and the Ideal Cache Model

Fractal trees (especially the cache‑oblivious B‑tree) are proven optimal in the ideal‑cache model of Frigo et al. This model assumes an ideal cache that automatically evicts the least recently used data; the algorithm’s performance is measured by the number of cache misses. The cache‑oblivious B‑tree achieves O(logᵦ N) cache misses for search and O((log N)/B) amortized cache misses for insertion, matching the lower bound for any comparison‑based external‑memory data structure. This is a powerful guarantee: it means that no other index can asymptotically beat the fractal tree in both reads and writes simultaneously.

8.3 Limitations of the Theoretical Model

The ideal‑cache model assumes that the cache is fully associative, has an optimal replacement policy, and uses a single cache line size. Real hardware has multiple levels, TLBs, prefetchers, and non‑LRU replacement. Nevertheless, the fractal tree’s performance on real hardware has been shown to closely match theoretical predictions in several case studies.

9. Are Fractal Trees the Future of Indexing?

9.1 Where They Excel

High‑volume streaming writes: IoT, log aggregation, financial tick data.
Point‑query‑heavy dashboards: where you need both fast inserts and fast retrievals of individual records.
Mixed workloads with moderate ranges: e.g., retrieving the last hour of events.

9.2 Where They Struggle

Very large range scans (full table scans): The fractal tree may need to consult buffers at each level, leading to extra I/O.
Memory‑constrained systems: Buffers consume RAM.
CPU‑bound queries: The overhead of scanning buffers can eat CPU cycles.

9.3 The Rise of Learned Indexes

An exciting frontier is learned indexes (Kraska et al., 2018), which use machine learning models to predict the location of keys instead of using a tree. Learned indexes can theoretically achieve O(1) lookups and extremely low write amplification (by appending writes in a log and periodically retraining the model). However, they are not yet mature for general workloads. Fractal trees, being deterministic, offer a predictable performance that many production systems require.

9.4 Persistent Memory and Z‑NSDs

With the advent of fast, byte‑addressable persistent memory (PMem) and zoned namespaces SSDs (ZNS), new trade‑offs emerge. PMem blurs the line between memory and disk; sequential vs. random write costs become less stark. Fractal trees can be adapted to PMem by storing buffers in PMem and flushing with cache‑line granularity instead of block granularity. Early research (e.g., FPTree, a fractal‑like tree for PMem) shows promise.

ZNS SSDs enforce sequential writes on zones. Fractal tree flushes are inherently sequential per child, making them a natural fit for ZNS, whereas LSM‑trees require careful zone management to avoid write amplification.

10. Conclusion: The Index That Cheats Time

We began with a bet—a bet that every database query relies on. For fifty years, the B‑tree was the safe bet. It was simple, well‑understood, and performed admirably for a world where writes were rare and reads dominated. But that world has changed. Today’s applications generate billions of writes per day: sensor data, clickstreams, financial trades, social media feeds. The hidden tax of write amplification has become a barrier to scale.

The fractal tree, born from a radical MIT thought experiment, cunningly bypasses that tax. By “remembering” writes in buffers and only applying them in large, efficient batches, it transforms random writes into almost‑sequential ones. It cheats the physics of disk I/O by amortizing the cost over many operations. And it does so without sacrificing the logarithmic search complexity that makes B‑trees so beloved.

Yet the fractal tree is not a magic bullet. It is a delicate balance of read and write performance, requiring careful engineering to handle concurrency, crash recovery, and memory pressure. Its adoption has been slower than the LSM‑tree, partly due to its complexity and partly because the industry has invested heavily in the simpler LSM‑tree architecture. But as more systems demand high write throughput without sacrificing read performance, the fractal tree is quietly being rediscovered.

In the end, every data structure is an attempt to cheat time—to make operations that should be slow appear fast. The fractal tree does it with elegance, with mathematics, with a simple trick: delay, batch, and flush. It is a reminder that sometimes the best way to go fast is to go slow—but only for a little while.

Further Reading

Bender et al., “Cache‑Oblivious B‑Trees” (2000)
Tokutek White Paper: “Fractal Tree Indexes for Insert‑Intensive Workloads”
Kuszmaul, “A Comparison of Fractal Trees and LSM Trees” (2014)
Arge, “The Buffer Tree: A New Technique for Optimal I‑O Algorithms” (1995)
Kraska et al., “The Case for Learned Index Structures” (2018)

Acknowledgments: The author thanks the researchers at MIT, SUNY Stony Brook, and Tokutek for pushing the boundaries of indexing and for making the world’s databases a little faster, one buffer at a time.

Word count: Approximately 10,200 words (including code, math, and headings).