Building An Efficient External Sort: Multiway Merge With Replacement Selection For Disk Bound Sorting

2023-08-27 · Leonardo Benicio

Summary

A comprehensive technical exploration of building an efficient external sort: multiway merge with replacement selection for disk bound sorting, covering key concepts, practical implementations, and real-world applications.

Contents

Here is the introduction for the blog post, crafted to meet your specifications for depth, technical accuracy, and narrative flow.

Introduction: The Last Frontier of Sorting

Imagine you are tasked with a deceptively simple job: sort 10 billion 100-byte records—roughly a terabyte of data—by a 10-byte key. Your workstation has a state-of-the-art 64 GB of RAM and a fast NVMe SSD, but even with swap space, no single machine can hold 10 billion keys comfortably in memory. If you try the classic in-memory qsort or a naive SQL ORDER BY, you will watch your system’s memory graph spike to 100%, the fan scream, and the operating system begin to thrash. After an hour, a kernel OOM killer will terminate your process. You have discovered the wall between internal and external sorting.

This is not an edge case from a 1980s mainframe. It is the daily reality of data engineering in the age of petascale logs, genomic sequence assembly, and distributed database compaction. Every second, companies like Uber, Netflix, and Google are sorting data that does not fit in RAM. Yet, unlike the dazzling innovations in distributed consensus or vectorized query execution, the external sort is often treated as a solved, boring problem—a relic of the tape-drive era. This is a dangerous misconception. The performance of a modern Data Lakehouse, the efficiency of a columnar store’s compaction, and the latency of a large-scale ETL pipeline often hinge on how well you can external-sort.

Why does this matter now? We are currently witnessing a massive increase in data volume, but the physical laws governing RAM have not kept up. Moore’s law has slowed for single-threaded performance, but the gap between CPU cycles and disk latency—the “memory wall”—is wider than ever. While we throw more memory at problems (m5.24xlarge instances have 384GB), we also process petabytes. Eventually, the dataset exceeds the budget, or the dataset is simply larger than any single machine. Furthermore, the rise of disaggregated storage (S3, HDFS) means that your data lives on a network-attached disk. In this world, the external sort is not just an algorithmic exercise; it is a fundamental infrastructure primitive. When you run a MERGE JOIN in a data warehouse, or when you create an index on a billion-row table, you are implicitly trusting the external sort implementation under the hood. Unfortunately, most default implementations are woefully suboptimal.

The Problem with “Simple” External Sort The textbook approach to external sorting is the Two-Way Merge Sort. The concept is trivial:

Phase 1 (Run Generation): Load chunks of data into memory, sort them internally (e.g., using std::sort), and write them back to disk as sorted runs.
Phase 2 (Merge): Read the first element from each run, pick the smallest, output it, and fill the empty slot from the same run. This works. But it is painfully slow. To sort $N$ data using $M$ memory runs, a $k$-way merge requires $\log_k(N/M)$ passes over the data. With a standard 2-way merge, for a 1TB file using 64MB of memory, you are looking at ~14 passes. At 200 MB/s disk throughput, that’s over 70,000 seconds—nearly a day.

Nearly every large-scale implementation must solve two hard problems to achieve acceptable performance:

Minimizing the number of initial runs.
Maximizing the merge width (the number of runs merged simultaneously) without killing random I/O.

This brings us to the two core techniques that separate a toy “merge sort” from a production-grade external sort: Replace Selection and Multiway Merge.

The Two Pillars of Efficient Disk-Bound Sorting

1. Replacement Selection: Breaking the Memory Barrier

The naive “load, sort, dump” approach creates runs whose average length is exactly the size of your input buffer. If you have 100 MB of RAM, you get 100 MB runs. But what if I told you that you could generate runs that are, on average, twice the size of your available memory? This is the magic of Replacement Selection (or Heapsort-based run generation).

Instead of waiting until the buffer is full, replacement selection maintains a min-heap in memory. You fill the heap, output the smallest element (the “winner”), and then read the next element from the input stream. If the new element is larger than the one you just output (i.e., it belongs to the same sorted run), you insert it into the heap. If it is smaller (it belongs to a later run), you set it aside. You can only output elements as long as your heap stays non-empty. The result is that a run terminates only when the heap is completely full of “frozen” elements that belong to the next run. This tends to produce runs of size 2 * Memory, because the heap effectively acts as a sliding window over the data.

Yes, this is an average case—and it degrades if the data is reverse-sorted—but in practice, for random data, this single trick can halve the number of merge passes. This is the first major optimization.

2. Multiway Merge: The Art of the Loser Tree

The second problem is merging. A standard binary merge is too slow because it only merges two runs at a time. A 16-way merge is much better because it reduces passes from 14 to 3. But a 16-way merge introduces a new bottleneck: the Priority Queue.

To merge 16 runs, you must repeatedly find the smallest current key across all runs. A naive linear scan is $O(k)$ per output element. That’s $O(N \cdot k)$, which is catastrophic. A binary heap (priority queue) is $O(\log k)$ per element—much better. But we can do better than a standard heap. The Loser Tree is a tournament-based data structure that is architecturally superior to the heap for external merging. While a heap spends $2 \cdot \log_2(k)$ comparisons per element (for insert + sift), a Loser Tree spends exactly $\log_2(k)$ comparisons. In theory, this is a constant factor improvement. In practice, because the Loser Tree has excellent cache locality (it walks a tree stored contiguously in an array), it can be significantly faster for high order merges (e.g., merging 500 runs).

This is where the rubber meets the road. You cannot just throw a priority_queue<Run> at the problem and expect it to scale. The implementation details—memory layout, prefetching, disk block size—determine whether your sort runs at 50 MB/s or 500 MB/s.

The Hidden Danger: Disk I/O Pattern

Even with optimal algorithms, the biggest killer of external sort performance is Random I/O. The standard merge process reads one block from Run A, then one block from Run B, then one from Run C… This forces the disk to seek constantly. If you are merging 100 runs, you are performing 100 seeks per output block. On an HDD, that is a disaster. On an SSD, it is bad because it reduces throughput.

The solution is double buffering and blocked I/O. You cannot read a single record at a time. You must read a block (e.g., 256KB or 1MB) from each run into memory, process it, and then read the next block sequentially from that run. This transforms the merge phase into a series of large, sequential reads (within each run) and a single sequential write (to the output). This is why modern external sorts look more like a blocked I/O scheduler than a sorting algorithm.

What This Post Will Cover

In the following sections, I will take you beyond the textbook theory and into the implementation details required to build a genuinely efficient external sort.

We will start with Phase 1: Run Generation. We will implement Replacement Selection using a C++ min-heap, while analyzing its performance characteristics and edge cases (e.g., reverse-sorted data, duplicate keys). We will examine exactly why the expected run size is $2 \times M$, and how to handle the “frozen” elements efficiently.

Next, we move to Phase 2: Multiway Merge. We will implement a Loser Tree data structure from scratch. We will compare its performance against a standard std::priority_queue merge. We will then implement a blocked I/O layer with double buffering (using pread with O_DIRECT to bypass the OS page cache) to ensure that the merge phase is bandwidth-bound, not latency-bound.

Finally, we will look at performance benchmarks. We will sort a 100 GB file on a commodity NVMe drive. We will measure:

Throughput (MB/s)
Disk utilization (I/O wait)
The impact of buffer size on run count

We will also discuss the Cassandra Multiway Merge problem—how to merge hundreds of SSTables without falling over due to seek overhead—and why a simple merge sort fails at that scale.

By the end, you will not just understand how to sort a terabyte of data in minutes. You will understand the engineering trade-offs that separate a toy sort from a production-grade one. You will be able to critique the sort implementations in Postgres, LevelDB, and Spark, and you will have a solid foundation for building your own.

Let’s get our hands dirty. The data is waiting.

Here is the main body of a technical blog post on building an efficient external sort.

The Main Body: Taming the Terahertz Tsunami

The fundamental problem of external sorting is brutally simple: you have a dataset—say, a terabyte of server logs, a multi-hundred-gigabyte database table, or a deluge of sensor readings—that is far too large to fit into the physical RAM of any single machine. Your task is to sort it. The naive approach of std::sort or Python's list.sort() fails instantly, not because of a lack of computational power, but because of a catastrophic failure of the memory hierarchy. The operating system will thrash, swapping pages in and out of virtual memory, causing performance to plummet by several orders of magnitude.

To solve this, we must abandon the comfortable world of in-memory algorithms and enter the domain of external sorting. The core tenet of external sorting is to minimize the total number of I/O operations, specifically disk seeks and sequential reads/writes. The classic algorithm, the “merge sort” we all know and love, provides the perfect blueprint. However, to make it truly efficient for the disk-bound world, we must discard the dumb, two-way merge and embrace two powerful, performance-critical optimizations: the Multiway Merge and Replacement Selection.

This post will dissect these two techniques, starting from the humble two-way merge, diagnosing its bottlenecks, and then building up to a production-quality external sort that can process terabytes of data with breathtaking efficiency.

Part 1: The Baseline and Its Broken Promise

Let’s begin with the classic two-phase external merge sort:

Phase 1 (Run Generation): Read a chunk of data into memory (a “buffer” of size M), sort it using an in-memory algorithm (e.g., quicksort, O(M log M)), and write this sorted chunk back to disk as a “run” .
Phase 2 (Merge): Open all the generated runs (files), read a page from each into a small input buffer, find the smallest element across all the buffer heads, write it to an output buffer, and replace the consumed element from the corresponding run. This is a single merge pass. If you can’t merge all runs in one pass, you do multiple passes (e.g., merge ten runs into one, then merge those ten results, etc.).

The Disk Bottleneck: The “Tape Head” Analogy

Imagine a traditional tape drive. To read a piece of data, you must physically wind the tape to the correct position. This is a hard seek. Hard disk drives (HDDs) are the modern equivalent. Even SSDs have a non-zero, often significant, latency for random reads compared to sequential reads. The cost of a random read() call that jumps from one file location to another is orders of magnitude higher than reading the same amount of data sequentially.

In a standard two-way merge, the algorithm is constantly seeking. You read a bit from run 1, then a bit from run 2, then back to run 1. This fine-grained interleaving of reads results in a horrific number of random seeks, completely destroying throughput.

The Core Solution: Sequential I/O is King

The goal of an external sort is not just to sort data; it’s to do so using block-based, sequential I/O. We want to read and write in large, contiguous chunks (e.g., 1 MB or 16 MB blocks). The merge phase must be designed to issue a few large sequential reads from each run, not many tiny random ones.

This is where the two-way merge fails. A multiway merge (e.g., a 16-way or 256-way merge) is the first and most critical step. By merging many runs at once, we can build a large input buffer for each run. We read a whole block from run 1 into its buffer, consume from it until it’s empty, and then read the next block. This transforms the I/O pattern from a random mess into a small number of predictable, sequential streams.

Part 2: The Multiway Merge and the Loser Tree

A standard two-way merge uses a simple “compare the two heads” logic (a 2-way tournament). For a K-way merge, we need an efficient data structure to find the smallest element among the K current heads. A naive approach is a linear scan, which is O(K) per element output. For K=256, that’s 256 comparisons per element—a significant CPU tax on top of the I/O cost. We need O(log K) per element.

The Winner Tree (Tournament Tree)

A classic approach is a binary tournament tree. You fill the leaves with the current head from each run. You then conduct a tournament: the smallest of two children advances to the parent, and so on up to the root, which contains the overall winner. You output the root and then replace its leaf with the next element from that run. You then re-run the tournament only along the path from that leaf to the root (O(log K) comparisons). This is a massive improvement over O(K).

The Loser Tree: A Subtle but Important Optimization

The Winner Tree is good. The Loser Tree is better. It reduces the overhead of the reconstruction step.

How a Loser Tree Works:

It’s also a complete binary tree.
The leaves are the heads of the runs.
Instead of storing the winner at each internal node, it stores the loser of the match played at that node. The winner of the match propagates up to play at the parent node.
At the end, the overall winner does not have a node to go to; it’s stored in a separate “overall winner” register.

The Key Advantage: When a leaf’s value is replaced (by the next element from its run), you don’t need to re-start the tournament from the leaf. You start with the new value and compare it against the “loser” stored at its parent. The winner of that comparison is the one that survives and continues upwards. This means you don’t need special handling for the final winner path. The loser tree is conceptually simpler to implement and often slightly faster.

Pseudocode for a Multiway Merge with a Loser Tree:

import heapq # Not used, for illustration of naive approach

class LoserTree:
    def __init__(self, initial_heads):
        self.k = len(initial_heads)
        self.tree = [None] * self.k  # Internal nodes store losers (run_index)
        self.heads = list(initial_heads) # Current head value for each run
        self.runs = [] # Placeholder for the actual run file objects

        # Initialize the tree
        self._init_tree()

    def _init_tree(self):
        # For each run i, play a "match" at its parent node.
        # The loser goes into the parent, the winner propagates up.
        for i in range(self.k):
            self._adjust(i)

    def _adjust(self, i):
        # i is the run index of the new/updated element
        winner = i # Initially, the current leaf is the winner
        parent = (i + self.k) // 2 # Integer division gives parent index

        while parent > 0:
            # Compare the 'winner' value against the 'loser' at this node
            if self.tree[parent] is None or self.heads[self.tree[parent]] > self.heads[winner]:
                # The previous loser is bigger than our winner. Our winner wins this match.
                # The loser (the bigger one) stays in the node.
                # The winner (the smaller one) continues up.
                self.tree[parent], winner = winner, self.tree[parent]
            else:
                # The stored value is smaller or equal. It wins. Our 'winner' loses.
                # Store the loser (our element) in the node.
                # The winner that continues up is the stored one.
                winner = self.tree[parent]
                # self.tree[parent] remains the same (it's the winner)
            parent = parent // 2

        # The final winner goes into the root (index 0)
        self.tree[0] = winner
        return winner

    def get_next(self):
        winning_run = self.tree[0]
        if winning_run is None:
            return None # All runs are exhausted

        value = self.heads[winning_run]

        # Get the next element from the winning run
        next_value = self._get_next_from_run(winning_run)

        if next_value is not None:
            self.heads[winning_run] = next_value
        else:
            # Run is exhausted. Set its head to infinity so it never wins again.
            self.heads[winning_run] = float('inf')

        # Re-calculate the tree, starting from the updated leaf
        self._adjust(winning_run)
        return value

    def _get_next_from_run(self, run_index):
        # In reality, this would read from a block buffer,
        # and if the buffer is empty, issue a sequential read from disk.
        # For simplicity, we assume a direct file iterator.
        pass

The K in multiway merge is a crucial parameter. It is directly tied to your available memory M. You need one input buffer per run. Therefore, K = M / B, where B is the size of a single input buffer. If your RAM is 1 GB and you want 64 MB buffers, you can merge K=16 runs. If you use 1 MB buffers, you can merge K=1024 runs. The first pass reduces the number of runs by a factor of K. This is a stark improvement over the factor of 2 from a two-way merge. The number of merge passes is now log_K(N_initial_runs), which is dramatically smaller.

Part 3: The Run Generation Problem - The Genius of Replacement Selection

The multiway merge solved the merge problem. But what about run generation? In the naive approach, run size is limited by your available memory M. You read M bytes, sort it, write a run of size M. This is a hard limit. What if you could generate runs that are, on average, twice as big as your memory? This is the magic of Replacement Selection.

The standard approach for generating initial runs is to read a full block of data into memory, sort it, and write it out. But this creates runs that are exactly M in size. The total number of initial runs is N / M. This is the bottleneck that determines the number of merge passes.

Replacement selection is a cunning algorithm that uses a priority queue (min-heap) of size M (in terms of number of records, not raw bytes). Its goal is to produce runs that are, on average, 2 * M records long.

The Algorithm in Detail:

Initialization:
- Read M records from the unsorted input into a heap.
- Let current_run_number = 0.
Run Generation Loop:
- Let last_output be -inf (or the smallest possible key).
- While the heap is not empty:
  - Pop the smallest element x from the heap (the root).
  - The Key Decision: If x.key >= last_output.key:
    - This element belongs to the current output run. Write x to the output buffer for run[current_run_number].
    - last_output = x.
    - Read the next element y from the unsorted input.
    - If y.key >= last_output.key:
      - y can also be in the current run! Push y onto the heap.
    - Else:
      - y is “too small” for the current run. It belongs to the next run. We do not discard it. We store it in a separate “frozen” area or simply treat it as part of a larger heap but mark it for a future run. A more efficient approach is to put it in a separate “hold” area that will be used to build the heap for the next run.
  - Else (x.key < last_output.key):
    - x is not in order for the current run. It should be part of the next run. We put x into the “hold” area for the next run. We do not push a new input element onto the heap for this slot; the heap size effectively shrinks.
Transitioning to the Next Run:
- When the heap is empty (all elements for the current run have been output), you finalize the current output run.
- The hold area now contains the elements that were “too small”. These become the initial heap for the next run.
- Set last_output = -inf and repeat the process.

Why This Works: A Concrete Example

Imagine your memory can hold 4 records, and your input is the following sequence of keys: [12, 8, 5, 15, 9, 1, 20, 6, 18, 3, 14, 7].

Initialization: Heap = [12, 8, 5, 15]

Pop 5 (last_output=5). Read next: 9 (9 >= 5). Push 9. Heap: [8, 9, 12, 15].
Pop 8 (last=8). Read next: 1 (1 < 8). Put 1 in Hold. Heap: [9, 12, 15].
Pop 9 (last=9). Read next: 20 (20 >= 9). Push 20. Heap: [12, 15, 20].
Pop 12 (last=12). Read next: 6 (6 < 12). Put 6 in Hold. Heap: [15, 20].
Pop 15 (last=15). Read next: 18 (18 >= 15). Push 18. Heap: [18, 20].
Pop 18 (last=18). Read next: 3 (3 < 18). Put 3 in Hold. Heap: [20].
Pop 20 (last=20). Read next: 14 (14 < 20). Put 14 in Hold. Heap: [].

End of Run 1: Output is [5, 8, 9, 12, 15, 18, 20]. That’s 7 records, almost double our memory of 4! The Hold area is [1, 6, 3, 14].

Run 2:

Heap = [1, 6, 3, 14] (These are the elements that were too small for the previous run)
Pop 1 (last=1). Read next: 7 (7 >= 1). Push 7. Heap: [3, 6, 7, 14].
Pop 3 (last=3). No more input. Heap: [6, 7, 14].
Pop 6 (last=6). Heap: [7, 14].
Pop 7 (last=7). Heap: [14].
Pop 14 (last=14).

End of Run 2: Output is [1, 3, 6, 7, 14]. A smaller run of 5.

Why is the average length 2M?

The theoretical proof is elegant. As the heap drains, the keys that go into the “hold” area are strictly less than the key that was just output (last_output). The keys that remain in the heap must be greater than or equal to the output.

Over a long sequence of random keys, the probability that a new key is larger than the current last_output is roughly 50%. Therefore, on average, half the new keys will join the current run, and half will go to the hold. Since the initial heap is of size M, and every time we pop one, we only push, on average, 0.5 new ones back, the heap drains at a rate that results in an average run length of 2M. For descending data, the run length is M (no new elements can be added). For ascending data, the run length is infinite (the entire file would be one run).

Implications: This is a 2x improvement in run size for free (almost). It uses no extra memory and requires no extra I/O. The only cost is the overhead of the heap operations, which is already O(log M) per record. This directly translates into half the number of initial runs, and therefore one fewer merge pass in the best case, or a much smaller K required for the next merge pass.

Part 4: The Complete Pipeline and Performance Analysis

Let’s tie it all together. A production-grade external sort combines everything:

Phase 1: Run Generation (Replacement Selection)
- Init: K-way loser tree or a min-heap of size M (in records).
- Loop: Read input, generate runs of average length 2M, writing them to disk sequentially.
- Output: A set of R runs on disk. R ≈ N / (2M).
Phase 2: Multiway Merge (Loser Tree)
- Init: Determine K, the number of runs we can merge in one pass. K = M / B, where B is the buffer size per run.
- Loop:
  - Open next set of K runs.
  - Initialize the loser tree.
  - While runs are not empty:
    - Read a block from each run into its buffer (sequential I/O).
    - Use the loser tree to pick the smallest element.
    - Write it to a new output run’s buffer.
  - The result is one larger sorted run.
- Repeat this merge pass until only one run remains. Each merge pass reduces the number of runs by a factor of K.

Performance Equation: A Practical Model

The total cost of an external sort is dominated by I/O. Let:

N = total number of records.
M = number of records that fit in memory.
B = number of records per disk block. (Also equal to the size of an input buffer in records).
R = N / M (number of naive runs). With replacement selection, R = N / (2M).

The number of merge passes P is: P = log_K(R) = log_K(N / (2M))

The total number of I/O operations is approximately: Total I/O = 2 * N * (1 + P)

This is because you read and write the entire dataset once for the initial run generation (2N), and once per merge pass (2N * P). The 2 comes from the fact that for every record you read, you also write it.

Example Calculation:

File size: 1 TB = 10^12 bytes.
Record size: 100 bytes => N = 10^10 records.
RAM: 16 GB. Let’s assume M = 10^8 records (16 GB / 100 bytes).
Naive runs: R_naive = 10^10 / 10^8 = 100 runs.
Replacement selection runs: R_rs = 100 / 2 = 50 runs.
Block size: 1 MB = 10,000 records.
Merge factor K = M / B = 10^8 / 10^4 = 10,000.

Number of Passes:

Naive: P = log_10000(100) = 0 (1 pass can merge 10,000 runs, so you can do it in one pass!).
Replacement Selection: P = log_10000(50) = 0 (Also one pass). (This shows that for many modern datasets, a single multiway merge pass is often sufficient.)

But what about a larger file?

File size: 1 PB = 10^15 bytes. N = 10^13 records.
Same RAM: M = 10^8. R_rs = 10^13 / (2*10^8) = 50,000 runs.
K = 10,000.
P = log_10000(50000) ≈ 1.2. You would need 2 merge passes.
Total I/O: 2 * N * (1 + 2) = 6 * N. That’s 6 Petabytes of I/O.

With naive runs (R=100,000), you would need P = log_10000(100000) ≈ 1.4, also 2 passes, but you’d be merging many more runs, making the first pass more expensive in terms of I/O and CPU.

The difference is subtle in this case, but in scenarios where K is smaller (e.g., you have a small memory footprint, or you need to use very large buffers to amortize SSD latencies), the 2x reduction from replacement selection can mean the difference between one pass and two passes. Two passes vs. one pass is a 2x difference in total I/O cost, which in the world of petabytes is a monumental savings.

Part 5: Real-World Applications and Modern Context

The techniques described here are not academic exercises. They are the beating heart of virtually every system that must sort data larger than RAM.

Database Sorting (e.g., PostgreSQL, MySQL, Oracle): The ORDER BY clause in SQL often triggers an external sort. When the database estimates that the result set is larger than the work_mem (PostgreSQL) or sort_buffer_size (MySQL), it immediately switches to an external sort. They use a multiway merge, and often a form of replacement selection (or a related algorithm known as “quick sort with heap”) for the initial run generation. The performance of this sorting directly impacts the speed of reporting, ETL jobs, and index creation (e.g., CREATE INDEX sorts all the data before building the B-tree).
Big Data Frameworks (Hadoop MapReduce, Apache Spark): The “shuffle” phase of MapReduce is arguably the most critical and expensive phase. It is a massive distributed sort. Each mapper generates intermediate runs (sorted by key). The reducers then perform a multiway merge across all the runs sent to them. Spark’s sortBy and repartitionAndSortWithinPartitions operators are prime examples. The performance of these operations dictates the speed of batch processing jobs. These systems often use a more advanced form of external sort called “timsort” (a hybrid of merge sort and insertion sort) for the in-memory phase, but the external merge phase remains a core multiway merge.
Log-Structured Merge-Trees (LSM-Trees): Used in modern databases like LevelDB, RocksDB, Cassandra, and Bigtable. An LSM-tree is not a single sort, but a cascade of external sorts. Data is first written to an in-memory “memtable” (which is sorted, often using a tree). When the memtable is full, it is flushed to disk as a sorted, immutable run (SSTable). Over time, the number of SSTables on disk grows. A background process called compaction performs a multiway merge across many SSTables to produce a single, larger, sorted SSTable. This entire compaction process is a continuous, multi-level external sort. The principles of minimizing disk seeks and using multiway merges are paramount to the performance of these databases.
Sorting Network Filesystems (e.g., RecordIO, TFRecords): In machine learning, large datasets are often stored in a set of sharded files (e.g., TFRecords). Before a training epoch, it’s sometimes necessary to globally shuffle or sort these records. External sorting is used to bring the data into a global order. Given the massive sizes of modern training datasets (hundreds of terabytes), an efficient, multiway, disk-bound sort is essential.

Conclusion: From Theory to Practice

Building an efficient external sort is a masterclass in applied computer science. It’s the art of aligning our algorithms with the physical realities of the hardware. We started with the naive two-way merge, saw its fatal flaw of random I/O, and rebuilt it using the multiway merge to leverage sequential throughput. We then attacked the problem of run generation, where Replacement Selection provided a near-magical 2x improvement in run size by using a priority queue as a runway, effectively doubling the size of memory.

The result is a pipeline that is theoretically sound, practically powerful, and universally applicable. The next time you run a CREATE INDEX on a multi-billion row table or wait for a distributed shuffle to complete in Spark, remember the humble loser tree and the clever trick of the replacement selection running down the tape. These are the unsung heroes of the data age, quietly taming the terahertz tsunami, one sequential read at a time.

Building an Efficient External Sort: Multiway Merge with Replacement Selection for Disk-Bound Sorting

Sorting petabytes of data on a single machine is an exercise in managing physical constraints. When the dataset far exceeds available RAM, the in-memory sorting algorithms we love—Quicksort, Timsort, Radix sort—hit a wall: they simply cannot see all the data at once. The solution is external sorting, a family of algorithms designed to minimize disk I/O while producing a fully sorted result.

In this post we dive deep into one of the most effective external sorting strategies: a multiway merge combined with replacement selection for generating initial runs. We’ll cover the theory, practical implementation details, performance pitfalls, and advanced techniques that separate a toy implementation from a production‑grade sort engine.

1. The Problem: Sorting When RAM Is the Scarce Resource

Assume we have 100 GB of unsorted records (each 100 bytes, so 1 billion records) on a single hard disk, and only 1 GB of available RAM. Sorting with an in‑memory algorithm is impossible. We must break the data into runs—small sorted sequences that fit in RAM—write them to disk, and then merge those runs together.

The classic external sort has two phases:

Run generation – partition the input, sort each partition in RAM, and write it to disk as a sorted run.
Merge phase – read the runs simultaneously, merge them using a priority queue, and write the merged output.

The number of passes through the data directly determines total I/O. Our goal: minimize the number of passes. The tools are replacement selection (to produce longer runs) and multiway merging (to merge many runs in a single pass).

2. Replacement Selection: Run Generation with a Twist

The Naive Approach

The simplest run generator reads a block of records equal to the available memory (e.g., 1 GB), sorts it in memory (say with Quicksort), and writes it out. This gives runs of size exactly M (the memory capacity). With 100 GB data and 1 GB memory, we get 100 runs.

I/O cost for run generation: Read every record once, write every record once → 100 GB read + 100 GB write.

How Replacement Selection Works

Replacement selection can produce runs with an expected length of 2M on uniformly distributed random data. The algorithm uses a min‑heap of size M and processes the input as a stream.

Algorithm outline:

Read M records into a heap.
Write the smallest record from the heap to the output run.
Read the next record from input. If its key is ≥ the key just written, insert it into the heap; otherwise, put it aside in a “second‑chance” buffer (these become part of the next run).
When the heap is empty, the current run is finished. Switch to the second‑chance buffer as the new heap, and continue.

Because we never need to sort the entire memory at once—we only maintain the heap—records with keys larger than the last output are immediately included, extending the current run. Records with smaller keys are deferred, effectively using the heap as a “window” into the data.

Why average run length is ~2M:

For random keys, the probability that the next key is larger than the last output is roughly ½. Thus, on average we can keep the heap alive until we have output about 2M records. The worst‑case (e.g., reverse‑sorted input) gives runs of exactly M (every new key is smaller), but for real‑world data, replacement selection almost always beats the naive approach.

Code Sketch (Python‑like)

import heapq

def replacement_selection(input_stream, output_writer, memory_size):
    # Phase 1: Load initial M records
    heap = []
    for _ in range(memory_size):
        rec = next(input_stream)
        heap.append(rec)
    heapq.heapify(heap)

    second_chance = []

    while heap:
        # Output smallest
        smallest = heapq.heappop(heap)
        output_writer.write(smallest)

        # Try to read next record
        try:
            next_rec = next(input_stream)
        except StopIteration:
            break

        if next_rec.key >= smallest.key:
            heapq.heappush(heap, next_rec)
        else:
            second_chance.append(next_rec)

        if not heap:
            # Start new run: promote second_chance as heap
            heap = second_chance
            heapq.heapify(heap)
            second_chance = []
            # signal end of run (write sentinel)
            output_writer.end_run()

Edge Cases & Pitfalls

Nearly sorted input: Replacement selection shines because most keys are already in order → runs can be huge (potentially the whole file if fully sorted). The algorithm degenerates to a single run—excellent.
Very small memory: If M is tiny (e.g., a few hundred records), the overhead of the heap may dominate. Consider using insertion sort for the initial segments.
Variable‑length records: Replacement selection assumes you can compare keys. With fixed‑length keys this is trivial; with variable‑length keys, you need to manage pointers and ensure the heap does not use excessive memory for record copies.

3. Multiway Merge: The Heart of the Merge Phase

Once we have many sorted runs (say 100 to 200 runs), we need to merge them into one sorted output. A binary merge tree would require log2(200) ≈ 8 passes—each pass reading and rewriting all data. With multiway merge, we merge all K runs in a single pass.

Using a Winner Tree (Tournament Tree)

A min‑heap of size K works for merging: push the first element of each run, pop the smallest, fetch the next from the same run, and push it. Complexity: O(N log K). For 200 runs, log₂200 ≈ 7.6, which is manageable. However, a simple heap‑based merge has a subtle performance drawback: each pop/push pair requires two comparisons and possibly many cache misses.

A winner tree (or tournament tree) is a specialised priority queue that maintains the winner of each subtree. It reduces the number of comparisons and is more cache‑friendly because it processes a fixed set of leaf nodes representing each input run. The winner tree is built as a complete binary tree with K leaves. Internal nodes store the “winner” (smallest key) of the subtree. After extracting the overall winner, we replace it with the next record from the same run and recompute only the path from that leaf to the root—that’s O(log K) comparisons, but the structure avoids the overhead of general heap operations.

Buffering: The Key to I/O Efficiency

Multiway merge requires reading from many runs simultaneously. Disk seeks dominate if we read one record at a time. The solution: input buffers for each run and one large output buffer.

Each run gets a buffer of, say, 1 MB.
We read a chunk from run i into its buffer. When the buffer empties, we issue a read ahead.
The merge algorithm reads from the in‑memory buffers (cache hits) and writes to the output buffer.
When the output buffer is full, we flush it to disk.

Double buffering can overlap I/O with CPU: while the merge consumes from one set of buffers, we preload the next set. Asynchronous I/O (e.g., aio_read on Linux, DispatchIO on macOS) is ideal.

Picking the Merge Factor `K`

The merge factor depends on the number of runs and the size of the output buffer. Let:

B = total input buffer memory (e.g., 500 MB of the 1 GB RAM)
K = number of runs
Buffer per run = B / K

If the buffer per run is too small (e.g., a few KB), we incur too many seeks. A rule of thumb: each buffer should be at least a few megabytes to make sequential reads efficient. On an HDD with 10 ms seek time and 200 MB/s transfer, a 4 MB buffer would take 20 ms to read (200 MB/s → 4 MB in 0.02s) vs 10 ms seek → seek dominates. So aim for buffers of 8–16 MB.

Given 500 MB for input buffers, we can support up to K = 500 MB / 16 MB ≈ 31 runs. That might be insufficient if replacement selection produced only 100 runs. In that case, we need multiple merge passes.

Multi‑pass Merge Strategy

A common approach is to use a merge tree: merge a large number of runs into a smaller number in the first pass, then merge those in subsequent passes until one remains. The optimal number of passes minimises total I/O: each pass reads and writes the entire dataset.

With N bytes of data, R initial runs, and K-way merge factor (max number of runs we can merge in one pass given memory), the number of passes is ⌈log_K(R)⌉. For example, R=100, K=20 → 2 passes (since 20²=400 > 100). Each pass costs 2 _ N I/O (read + write). So total I/O = 2 _ N * (number of passes). Using replacement selection reduces R (by up to half) and hence the number of passes.

4. Advanced Techniques and Deep Insights

I/O Complexity Analysis

External sorting is about minimising the number of I/O operations, each operating on blocks of size B (disk block or buffer size). The classic I/O model counts transfers. For a dataset of size N and memory of size M (both in records), the number of I/Os is:

Run generation: O(N/B) reading and O(N/B) writing.
Merge: Each pass reads and writes the entire data: O((N/B) * log_{M/B}(N/M)).

The term log_{M/B}(N/M) represents the merge tree depth. Notice the base is M/B (the number of blocks that fit in memory). Replacement selection improves the N/M factor by a constant of 2 (on average), reducing the log factor slightly.

Handling Non‑Uniform Data Distributions

Replacement selection’s performance degrades on reverse‑sorted data (run length = M). To mitigate, we can:

Use a hybrid: if we detect that replacement selection is producing runs shorter than, say, 1.5 M, fall back to the naive approach (which produces exactly M-sized runs) but then apply a trick: sort a larger portion using quicksort in memory, which may be faster.
Forecasting: Pre‑scan a sample to estimate the key distribution. For nearly sorted data, replacement selection will naturally produce huge runs anyway.

Optimising the Winner Tree for Performance

A straightforward heap‑based merge can be faster than a winner tree for small K (because heap push/pop is fast). For K > 50, the winner tree wins due to fewer comparisons and better locality. Additionally, we can store the winner tree keys inline with the input record pointers to avoid indirection.

Compression

Writing runs to disk is I/O‑bound. If we compress each run before writing, we reduce I/O at the cost of extra CPU. For many workloads (especially text data), compression ratios of 2:1 to 5:1 are common, dramatically reducing the number of passes. Use a fast codec like LZ4 or Snappy rather than gzip. Ensure the compression is block‑based so we can decompress random blocks when seeking.

Parallelism and Overlap

Modern systems have multiple cores and sometimes multiple disks. We can:

Parallel run generation: Distribute the input across multiple threads, each generating runs concurrently.
Asynchronous I/O: While one thread consumes buffers, another thread fills them.
Multi‑disk striping: If possible, write runs to different physical disks to increase bandwidth.

Handling Duplicate Keys

Replacement selection with duplicates is fine: keys can be equal; the algorithm treats them as “not smaller”. In the merge, duplicates are handled transparently. However, if stability is required (original order among equal keys), the merge must use secondary tie‑breakers (e.g., run ID, original sequence number).

5. Common Pitfalls and Best Practices

Pitfall 1: Ignoring the I/O Bottleneck

Many implementations optimise the CPU path (e.g., using a fancy winner tree) but neglect I/O. Profile: if the disk is 100% busy and CPU is 10%, no amount of CPU optimisation will help. Focus on minimising seeks, maximising sequential transfers, and overlapping I/O with computation.

Pitfall 2: Poor Buffer Sizing

If the per‑run buffer is too small, the disk spends most of its time seeking. If too large, the total number of mergable runs in one pass drops. Measure your drive’s sequential bandwidth and seek time. For SSDs, seek time is negligible, so you can reduce buffer sizes and merge many more runs per pass, but watch out for random read throughput.

Pitfall 3: Not Profiling the Run Generation

Replacement selection can be slower than Quicksort on small runs if the data is not random. For small M (e.g., ≤ 100,000 records), the overhead of the heap and the per‑record comparisons may outweigh the benefit of longer runs. Benchmark: if replacement selection runs take more than 2x the time of sorting in memory, consider using a hybrid.

Pitfall 4: In‑Memory Sorting That Is Not Adaptive

When the input is partially sorted, Quicksort degrades badly. Use Timsort or Introsort for run generation. Timsort adapts to natural runs and may even create longer initial runs automatically, sometimes outperforming replacement selection’s average case.

Pitfall 5: Forgetting to Handle Partial Final Runs

At the end of input, the last run may be incomplete. Ensure the merge algorithm can handle runs of different lengths. The winner tree should gracefully handle runs that run out.

Best Practice 1: Use a Buffer Pool

Instead of allocating fixed buffers per input run, use a buffer pool that pages in blocks on demand. This allows dynamic adjustment and can reduce memory fragmentation.

Best Practice 2: Run Pre‑processing

If records are variable length, pad them to fixed length or use an index file. Fixed‑length records make all comparisons and offsets simpler.

Best Practice 3: Test with Many Edge Cases

All equal keys → runs can be combined, but be careful about heap starvation.
Single record → trivial.
Input larger than disk (virtual memory? Use external sort).
Duplicates heavy → ensure stable ties.

Best Practice 4: Leverage Existing Libraries

Unless you’re building for learning, consider using mature implementations like the one in Apache Spark (for distributed), GNU sort (single machine), or the custom sort engines in databases. They have already solved these pitfalls.

6. Deeper Insights: Why Replacement Selection is Not Always Optimal

Despite the theoretical appeal, replacement selection has shortcomings:

It requires a heap comparison per record, which is O(1) but constant heavier than a memory‑sorting bulk.
The run length is only guaranteed to be at least M, not exactly 2M. On adversarial inputs it can be worse.
Modern CPUs are fast; the bottleneck is nearly always I/O. Using replacement selection to halve the number of runs (and thus reduce I/O passes) is usually beneficial, but if the I/O passes are already minimal (e.g., one merge pass), then the overhead is wasted.

A different approach: Radix sort for run generation. Radix sort is linear in key length and can be faster than comparison‑based sorts for fixed‑length keys, especially when the key space is small. However, it requires extra memory for buckets. For large records, radix sort may be overkill.

When to skip replacement selection:

If you have an SSD and high random read throughput, the cost of many small runs is less painful.
If you are merging only once (e.g., K is large enough to take all runs), then initial run length matters less.
If your data is already nearly sorted, Timsort may produce runs longer than 2M anyway.

7. Putting It All Together: Example Scenario

Let’s solidify with a practical design.

Problem: Sort 100 GB of 100‑byte records (1B records) using 1 GB RAM. Disks: one SATA HDD (seek ~10 ms, transfer 200 MB/s).

Step 1: Run generation
Use replacement selection with a heap of size ~800 MB (leave 200 MB for other uses). Average run length ~1.6 GB. Number of runs ~62.5. Realistically, 62–63 runs.

I/O: Read 100 GB, write 100 GB (including runs). Plus overhead of second‑chance buffer writes (negligible).

Step 2: Multiway merge
We have about 62 runs. We can allocate 500 MB for input buffers. With per‑run buffer of 8 MB → 62 × 8 MB = 496 MB. That fits. So we can do a single merge pass.

Use a winner tree of size 62. Read from each run sequentially (one big sequential read per run, but interleaved among all runs due to merging). Because the runs are written sequentially, the merge will read them sequentially (though the disk may seek between runs if the runs are scattered). To minimise seeks, store runs as contiguous as possible on disk, and perform the merge in a way that the disk head moves linearly through all runs. In practice, this is hard; using a larger per‑run buffer (16 MB) and overlapping reads helps.

I/O: Read 100 GB (the runs), write 100 GB (final output). Total I/O = 200 GB read + 200 GB write across both phases. Time estimate: 400 GB at 200 MB/s = 2000 seconds ≈ 33 minutes. In reality, with seek overhead, perhaps 40–50 minutes. That’s efficient.

Alternative: Without replacement selection, we’d have 100 runs. With the same buffer allocation, we can only merge 62 runs at once, so we need two merge passes. Total I/O becomes: phase 1 (100 GB read + write), phase 2 merge (100 GB read + 100 GB write for first merge, then another read/write for final pass). That’s 300 GB read + 300 GB write = 600 GB, taking about 50 minutes just for transfer, plus seeks → much slower. Replacement selection saved one pass – a 33% improvement.

8. Conclusion

External sorting remains a fundamental technique for processing large datasets on a single node. The combination of replacement selection for run generation and multiway merge using a winner tree creates a powerful, I/O‑efficient sorter that pushes the limits of disk bandwidth.

Key takeaways:

Replacement selection yields runs of expected length 2M, reducing the number of runs (and merge passes) at the cost of extra per‑record CPU.
Multiway merge must manage buffer sizes carefully to avoid excessive disk seeks.
The I/O complexity is the dominant factor; optimising the I/O pattern (sequential, buffered, overlapped) is more valuable than micro‑optimising the merge comparison.
Edge cases (reverse‑sorted, duplicates, variable‑length records) require attention, and hybrid approaches (fallback to naive sort) can stabilise performance.

For production systems, consider using established implementations, but understanding these internals allows you to tune for your specific hardware and data characteristics. Whether you’re building a custom analytics pipeline or just trying to beat the default sort -M on a huge file, mastering external sort will give you a tangible performance edge.

Now go forth and sort those terabytes—one block at a time.

Conclusion: Mastering Disk-Bound Sorting with Multiway Merge and Replacement Selection

In this concluding section, we step back from the technical intricacies of external sorting and examine the broader significance of the techniques we’ve covered. The journey from a naive merge sort that reads and writes data dozens of times to an efficient multiway merge powered by replacement selection is more than an academic exercise—it is a practical skill that separates high-performance systems from those that choke on datasets larger than RAM. By now you understand that the core challenge of disk-bound sorting is not algorithmic complexity in the traditional sense (O(n log n) comparisons), but rather the cost of I/O. Every unnecessary disk seek and block transfer kills throughput. The techniques we’ve explored—replacement selection to produce longer initial runs and multiway merging to reduce the number of merge passes—directly attack that bottleneck.

Let’s synthesize what we’ve learned and, more importantly, translate it into actionable strategies for your own systems.

Key Points Recap: The Two Pillars of Efficient External Sorting

We started by acknowledging that when the dataset exceeds available RAM, an in-memory sort is impossible. The textbook approach—breaking data into blocks, sorting each block in memory (e.g., with quicksort), writing them as runs, and then repeatedly merging runs—works, but its I/O cost is proportional to the number of passes. A naive k-way merge with small k (e.g., 2-way) requires log₂(initial_runs) passes. If the dataset is 100× larger than memory, you might have 100 initial runs and need 7 passes with 2-way merge. That’s a lot of disk traffic.

Replacement selection changes the game. Instead of sorting each memory-sized block independently, we use a priority queue (a min-heap) that allows us to emit a run as long as the next key is greater than the last key output. By “riding” on the heap, we can produce runs that are, on average, twice the size of memory (2M). In the best case, with completely random data, replacement selection achieves exactly 2M runs. For partially presorted data (e.g., a nearly sorted input), runs can be arbitrarily long, approaching the entire dataset. This directly reduces the number of initial runs by half compared to naive block sorting, which means one fewer pass in the merge phase—a huge I/O saving. The algorithm is elegantly simple: load M records into a heap, output the smallest, pull in a new record from disk; if it’s ≥ the last output, push it into the heap; otherwise, it goes into a deferred queue for the next run. When the heap empties, the run is complete.

Multiway merge is the second pillar. Instead of a binary merge that repeatedly merges two runs, we merge k runs in one pass using a heap of size k. With k as large as memory allows (say, up to the number of file descriptors or buffer space), we can reduce the number of merge passes to log_k(initial_runs). If k = 100 and we have 50 initial runs (thanks to replacement selection), we need only a single merge pass! Even with k = 10 and 100 runs, we need only 2 passes instead of 7 with 2-way merge. The I/O cost becomes roughly 2 * (dataset size) for reading and writing each pass, plus the initial run generation. The multiway merge is implemented with a min-heap over the front elements of each run; we repeatedly pop the smallest, write it to output, and push the next element from that run. The heap operations are O(log k) per record, which is negligible compared to I/O. The real bottleneck is managing the input buffers: with too many runs, we can’t allocate a full block per run, leading to many small reads. The practical trick is to use a “double buffering” scheme: while one set of buffers is being written or merged, the next blocks are pre-fetched asynchronously.

Together, replacement selection and multiway merge constitute the classic external sort found in textbooks like Knuth’s The Art of Computer Programming, Volume 3, and implemented—in some form—in database engines, the Linux sort utility, and MapReduce shuffle phases. But knowing the theory is one thing; applying it with today’s hardware constraints requires nuance.

Actionable Takeaways for Your Own External Sort Implementation

If you are building a disk-bound sorting component—whether for a custom database, a log processing pipeline, or a scientific computing tool—here are the concrete steps to maximize performance:

Measure your I/O characteristics before coding. The optimal block size and number of merge streams depend on your storage technology. For HDDs, sequential bandwidth is king; use large blocks (e.g., 64–256 KB) to minimize seeks. For SSDs, random reads are cheap but not free; smaller blocks (4–16 KB) can reduce wasted data and improve parallelism. Profile your system’s sustained sequential throughput and random IOPS to inform your design.
Implement replacement selection, but benchmark against simple quicksort runs. While replacement selection theoretically halves the number of runs, its CPU overhead (heap operations for every input record) matters. For datasets that are already sorted or nearly sorted, replacement selection is a clear win. For completely random data, the 2M benefit is nice but may be offset if your merge is already fast enough. Test with your data.
Choose k (merge fan-in) based on available memory and block size, not arbitrarily. You need at least one input buffer per run (plus one output buffer). If each buffer is B bytes and you have M bytes for merge buffers, the maximum k is floor(M / B) - 1. But smaller k means more passes. Balance: a larger B yields better sequential throughput, but a larger k reduces passes. Use a formula: total I/O cost ≈ (number of passes + 1) * dataset_size. For a given M, solve for k that minimizes passes while keeping B >= some minimum. Typically, a k of 10–100 works well, but do not exceed the file descriptor limit (often 256 or 1024) without special handling.
Optimize I/O with asynchronous or overlapped operations. While the merge loop pops from the heap, the next block for each run can be read in advance using double buffering. Many modern operating systems support asynchronous I/O (e.g., io_uring on Linux, overlapped I/O on Windows). Use it to keep the disk queue full. This can double throughput on spinning disks where seek latency would otherwise stall the CPU.
Consider using external sort as a building block for larger systems. If you are implementing a MapReduce-style shuffle, the partitioning step can be integrated with replacement selection: instead of a single global sort, you sort each partition independently and then merge. Similarly, database merge-joins often rely on an external sort of one or both inputs. Understanding the trade-offs between run generation and merge passes helps you tune global parameters like sort buffer size and parallelism.
Fall back to simpler algorithms for small datasets. The overhead of replacement selection and large k-way merge is wasted if the data fits in memory. Always check dataset size relative to available RAM and fall back to an in-memory sort (qsort, std::sort, etc.) when possible. Some implementations use a hybrid: if the initial runs after replacement selection already cover the entire input (i.e., the input was nearly sorted), skip the merge entirely and just write the single run.

A Strong Closing Thought: The Enduring Relevance of I/O-Aware Algorithms

In an age where memory is cheap and SSDs are ubiquitous, you might wonder whether external sorting still matters. After all, many datasets now fit in the RAM of a single server (128 GB is common; 1 TB not unusual). But data growth outpaces memory growth. Log files, scientific simulations, graph databases, and internet-scale analytics often exceed memory by orders of magnitude. Moreover, the rise of cloud computing with ephemeral disks and bandwidth-limited object stores means every I/O operation comes at a cost—both latency and money. The principles behind replacement selection and multiway merge are not relics of the 1960s; they are timeless lessons in designing algorithms that respect the memory hierarchy.

Consider the larger lesson: efficient external sorting teaches us that the best algorithm is not always the one with the lowest computational complexity. Instead, it is the one that minimizes the dominant cost—here, disk seeks and transfers. This insight extends beyond sorting. Any algorithm that processes data larger than RAM (hash joins, graph analytics, matrix multiplication) must consider the same trade-offs: how to partition data into blocks, how to merge partial results with minimal I/O, and how to exploit sequential access. The mental model of a “run” and a “merge pass” is a powerful abstraction for reasoning about disk-bound computation.

Finally, I invite you to build your own small external sorter from scratch. There is no better way to internalize these concepts than by writing the code, benchmarking it against a naive baseline, and watching your implementation shave seconds—or even minutes—off a gigabyte-scale sort. As you do, remember that you are standing on the shoulders of giants like Knuth and Bentley, who refined these ideas decades ago. Yet every new storage technology (NVMe, Intel Optane, shingled magnetic recording) breathes new life into the problem. The field is far from dead.

So go forth, write multi-blocked, heap-driven, cache-oblivious merges, and may your sorts be both efficient and elegant. Thank you for reading, and happy sorting.