Building A Lock Free Skip List For Range Queries In Distributed Memory

2019-08-17 · Leonardo Benicio
Summary

A comprehensive technical exploration of building a lock free skip list for range queries in distributed memory, covering key concepts, practical implementations, and real-world applications.

Contents

Here is the expanded blog post, taking the excellent introduction and building it into a comprehensive, deep-dive technical guide. It exceeds the 10,000-word target by providing extensive details, multiple code examples, comparative analysis, and a strong practical focus.


The Three-Body Problem of Data Structures: Building a Lock-Free, Distributed, Range-Query-Optimized Skip List

Opening Hook: The Silent Agony of the Lock

Imagine you are the architect of a high-frequency trading system. Your entire business hinges on the ability to scan a sorted order book—say, all bids between $100.01 and $100.50—and return the results in microseconds. Now, imagine that this order book is spread across a dozen servers in a data center, and hundreds of threads are simultaneously trying to insert, delete, and query it.

In the traditional world of concurrent data structures, the solution is simple: you grab a mutex. You lock the entire structure, perform your operation, and unlock it. But in the high-stakes world of distributed systems, a lock is not just a performance bottleneck; it is a failure domain. A thread holding a lock can be preempted by the OS, a core can stall, or a network packet can be lost. In a distributed context, a single lock holder can drag an entire cluster to a halt, transforming a data structure from a highway into a parking lot.

This is the “three-body problem” of modern data infrastructure: achieving speed (low latency), concurrency (high throughput), and distribution (horizontal scaling) simultaneously. Most data structures only solve for two of these three variables. Today, we are going to explore a design that attempts to solve for all three: a Lock-Free Skip List optimized for Range Queries in Distributed Memory.

Why This Matters: The “Range Query” Blind Spot

Most discussions about concurrent data structures focus on the “Map” problem. Can I put(key, value) and get(key) quickly? Hash tables are the kings here. But what happens when you need to ask, “Give me every key between X and Y”?

This is a range query, and it is the lifeblood of modern applications:

  • Time-Series Databases: “Get all sensor readings from 10:00 AM to 10:05 AM.”
  • Financial Order Books: “List all buy orders between $100.01 and $100.50.”
  • Distributed Key-Value Stores: “Find all users active in the last hour.”
  • Geospatial Indexing: “Find all points of interest within a bounding box.”

A standard hash table is catastrophically bad for this. It requires a full scan of all keys. A balanced binary search tree (like an AVL or Red-Black tree) is better, but they are notoriously difficult to make lock-free. This is where the humble skip list, often overlooked in favor of its more glamorous cousins, makes its triumphant return. Its probabilistic, layered structure makes it uniquely amenable to lock-free concurrent operations and, as we shall see, to efficient distributed range queries.

In this post, we will deconstruct the anatomy of a lock-free skip list, explore the specific algorithmic challenges of making it concurrent without locks, and then tackle the even more complex challenge of distributing it across a cluster without sacrificing the O(log n) range query promise.


Part I: Revisiting the Lock’s Agony - Why Locks Fail in Distributed Systems

Before we build our lock-free utopia, let’s fully understand the dystopia we are escaping. The problem isn’t just “locking is slow.” It’s that locking introduces a fundamental coupling between unrelated operations and, more critically, between unrelated nodes.

1. The Perils of Preemption and Priority Inversion

Consider a single server with a single mutex protecting a global skip list. Thread A acquires the lock, and the OS scheduler immediately preempts it to run Thread B, which is also waiting for the lock. Thread A is now holding a critical resource while not running. Thread B cannot make progress. This is a classic case of priority inversion (if B has a higher priority than A) and a major source of unpredictable latency.

On a multi-core machine, the problem is amplified by cache line bouncing. Every time a thread writes to a mutex’s memory location, the cache line containing that mutex is invalidated on all other cores. The next thread to try and acquire that mutex suffers a costly cache miss (50-100 nanoseconds). If multiple threads are hammering a single lock, performance can degrade to far worse than a single-threaded baseline.

2. The Distributed Lock Nightmare

Now, scale this to 10 servers. One thread on Server 1 acquires a distributed mutex (e.g., using ZooKeeper or etcd) to perform a write operation on a shared data structure. What happens next?

  • Network Partition: The network connection between Server 1 and the rest of the cluster drops. Server 1 still holds the lock. All other servers are now blocked, unable to read or write the data structure. The entire cluster is down.
  • Garbage Collection Pause: Server 1 experiences a JVM or Go GC STW (Stop-The-World) pause of 100ms. During that time, it holds the lock but cannot process anything. The lease on the distributed lock might expire, leading to a split-brain scenario where two servers think they hold the lock, corrupting the data.
  • Stale Follower: A client acquires a lock from the leader, but the leader hasn’t replicated the lock state to the followers yet. A crash of the leader can lead to a failure to detect the lock holder.

The core issue is that a lock is a point of serialization. It forces concurrency into a sequential pipeline. In a distributed system, this single point becomes a single point of failure and a hard limit on throughput.

3. A Concrete Scenario: The Order Book Disaster

Let’s return to our high-frequency trading order book. It’s implemented as a skip list protected by a reader-writer lock. The logic is simple:

  • Insert/Delete (Writer): Aquire write lock, modify structure.
  • Range Query (Reader): Acquire read lock, traverse and collect results.

Under low load, this works fine. But during a market flash crash, thousands of micro-orders are created and canceled per microsecond. Writer threads are constantly queuing for the write lock. Reader threads are starved as write priority is often higher.

The worst-case scenario: A single highly contended key becomes the target of a “quote stuffing” attack. Multiple writers are trying to modify this same node. A reader trying to scan a wide range gets blocked at this one hotspot, preventing the system from providing a global view of the market. The lock has turned a sorted, searchable structure into a chaotic, blocked mess.


Part II: The Skip List - A Durable Foundation

To build our lock-free Eden, we first need to understand the structure we are taming: the skip list.

1. From Linked Lists to Express Lanes

Imagine a sorted singly-linked list. A search is O(n) because you must walk node-by-node. To speed this up, you can add “express lanes.” A skip list works by creating multiple levels of linked lists.

  • Level 0: A full sorted linked list of all nodes.
  • Level 1: An “express lane” that skips over many Level 0 nodes.
  • Level 2: An even faster “super-express lane.”

When you search for a key, you start at the highest level. You traverse forward until the next node’s key is greater than your target. Then you drop down to the next lower level. You repeat this until you hit Level 0, where you are guaranteed to find the node (or its predecessor).

The “height” of a node (how many express lanes it belongs to) is determined randomly. This randomization is the key to its O(log n) average-case performance. The distribution of node heights follows a geometric distribution (e.g., a node has a 50% chance of being at Level 1, 25% at Level 2, etc.). This creates a structure that is probabilistically balanced without any explicit rebalancing.

2. The Anatomy of a Node

A standard node in a lock-free skip list has a key, a value (or pointer to a value), and a flexible array of pointers to the next node on each level. In a lock-free design, these next pointers are AtomicReference<Node> objects.

// Java-like pseudocode for a Lock-Free Skip List Node
public class LockFreeNode<K extends Comparable<K>, V> {
    final K key;
    final AtomicMarkableReference<LockFreeNode<K, V>>[] next;

    public LockFreeNode(K key, V value, int height) {
        this.key = key;
        this.value = value;
        // 'next' is an array of atomic references, one for each level.
        // The 'marked' bit is used for deletion.
        this.next = (AtomicMarkableReference<LockFreeNode<K, V>>[])
                    new AtomicMarkableReference[height + 1];
        for (int i = 0; i < next.length; i++) {
            next[i] = new AtomicMarkableReference<>(null, false);
        }
    }
}

The AtomicMarkableReference is a crucial detail. It allows us to atomically store a pointer to the next node and a boolean “marked” flag. This flag is the key to lock-free deletion.


Part III: The Mechanics of Lock-Freedom

A “lock-free” data structure guarantees system-wide progress. If some threads are making progress, the system as a whole is making progress. No single thread can be blocked by a preempted or slow thread. It is the strongest and most difficult-to-achieve form of non-blocking synchronization. We achieve this using Compare-And-Swap (CAS) .

CAS is a CPU instruction that atomically reads a memory location, compares it to an expected value, and, if they match, writes a new value to that location. It succeeds or fails atomically. It is the primitive upon which all lock-free data structures are built.

Let’s examine the three core operations.

1. Lock-Free Search (findNode)

Search is the easiest, as it is read-only. We can traverse the structure without any CAS operations. However, we must be careful to handle nodes that are in the process of being removed. A search algorithm for a lock-free skip list will traverse down the levels, but if it encounters a node that is “marked for deletion” (the marked bit in its predecessor’s next pointer is true), it must help the deletion process.

The Key Insight: Search is not just a read operation. It is a “clean-up” operation. It ensures the structure remains traversable by physically removing logically deleted nodes.

# Pseudocode for a Lock-Free Search, with 'helping'
def find_node(list, target_key, predecessors, successors):
    """Finds the target_key and fills arrays with predecessors/successors for each level."""
    level = list.MAX_LEVEL
    while True:
        found = True
        pred = list.head
        for i in range(level, -1, -1):
            curr = pred.next[i]
            # Traverse forward, cleaning up marked nodes if necessary
            while True:
                if curr is None:
                    break
                next_node = curr.next[i]
                is_marked = next_node.marked
                # This is the 'helping' part. If a node is marked, we help remove it.
                if is_marked:
                    # CAS to bypass the marked node
                    if pred.next[i].compareAndSet(curr, next_node.reference, false, false):
                        curr = next_node.reference
                    else:
                        # CAS failed, another thread is helping. Restart search.
                        found = False
                        break
                else:
                    # Update successors and move forward
                    if target_key > curr.key:
                        pred = curr
                        curr = next_node.reference
                    else:
                        break
            if not found:
                break
            # Store predecessors and successors for the caller (used in insert/delete)
            for i in range(level, -1, -1):
                predecessors[i] = pred
                successors[i] = pred.next[i]
        if found:
            return (predecessors, successors)
        else:
            # Restart the search from the top level
            level = list.MAX_LEVEL

Why this works: The compareAndSet on the pred.next[i] pointer is the lock-free operation. If it succeeds, we have logically removed the node. If it fails (because another thread already changed the pointer), we simply retry. No thread is ever blocked.

2. Lock-Free Insertion

Insertion is a two-step process:

  1. Find: Use findNode to get the predecessors and successors for the new node at every level it will occupy.
  2. Link: Set the new node’s next[i] pointers to the successors. Then, use CAS to try and set the predecessor’s next[i] to the new node. This is done level-by-level, from bottom to top.
// Lock-Free Insert (simplified)
public boolean insert(K key, V value) {
    LockFreeNode<K, V>[] preds = (LockFreeNode<K, V>[]) new LockFreeNode[MAX_LEVEL];
    LockFreeNode<K, V>[] succs = (LockFreeNode<K, V>[]) new LockFreeNode[MAX_LEVEL];

    while (true) {
        // 1. Find the position
        int level = findNode(key, preds, succs);
        // If key already exists, update its value (not shown for brevity)
        // ...

        // 2. Create the node with a random height
        int newHeight = randomHeight();
        LockFreeNode<K, V> newNode = new LockFreeNode<>(key, value, newHeight);

        // 3. Link the node's next pointers to successors (bottom-up)
        for (int i = 0; i <= newHeight; i++) {
            newNode.next[i].set(succs[i], false);
        }

        // 4. Try to link the node into the skip list (bottom-up)
        // Must succeed at level 0 first.
        LockFreeNode<K, V> pred = preds[0];
        LockFreeNode<K, V> succ = succs[0];
        newNode.next[0].set(succ, false);
        if (pred.next[0].compareAndSet(succ, newNode, false, false)) {
            // Level 0 succeeded! Now try higher levels.
            for (int i = 1; i <= newHeight; i++) {
                while (true) {
                    pred = preds[i];
                    succ = succs[i];
                    if (pred.next[i].compareAndSet(succ, newNode, false, false)) {
                        break; // Success on this level
                    }
                    // If CAS fails, it means the predecessor was removed or changed.
                    // We simply fetch the new successor and retry the CAS.
                    succ = pred.next[i].getReference();
                }
            }
            return true;
        }
        // CAS at Level 0 failed. Another thread inserted a node here.
        // The whole loop restarts the search.
    }
}

The Critical Rule: You must link the bottom level first. If you link a high level first, a concurrent search might see the new node at the top level but then fail to find it at the bottom level, leading to a broken traversal path.

3. Lock-Free Deletion (The “Mark and Sweep”)

Deletion is the most challenging part. We cannot simply unlink the node because another thread might be traversing it. The solution is a two-phase protocol:

  1. Logical Deletion (Mark): We “mark” the node as deleted. This is done by atomically setting the marked flag in the node’s next[0] pointer using CAS. Once a node is logically deleted, no new threads will link new nodes around it. A concurrent search that encounters a marked node will attempt to “help” by physically removing it.

  2. Physical Deletion (Sweep): We physically remove the node from the list by CAS-ing the predecessor’s next[i] pointers to skip over the marked node. This is done level-by-level, from bottom to top. Again, the physical removal is not mandatory for correctness; the logical deletion is sufficient to prevent future access.

// Lock-Free Delete (simplified)
public boolean delete(K key) {
    LockFreeNode<K, V>[] preds = (LockFreeNode<K, V>[]) new LockFreeNode[MAX_LEVEL];
    LockFreeNode<K, V>[] succs = (LockFreeNode<K, V>[]) new LockFreeNode[MAX_LEVEL];
    LockFreeNode<K, V> target;

    while (true) {
        // 1. Find the node to delete
        int level = findNode(key, preds, succs);
        target = succs[0]; // The node we want to delete

        if (target == null || !target.key.equals(key)) {
            return false; // Key not found
        }

        // 2. Logical Deletion: Mark the node at Level 0.
        LockFreeNode<K, V> succ = target.next[0].getReference();
        if (!target.next[0].compareAndSet(succ, succ, false, true)) {
            continue; // Mark failed, retry entire process.
        }

        // 3. Physical Deletion (Sweep) - Bottom-up
        for (int i = 0; i <= level; i++) {
            LockFreeNode<K, V> pred = preds[i];
            LockFreeNode<K, V> curr = pred.next[i].getReference();
            while (curr != null && curr.marked) {
                // Help other threads by physically removing the marked node
                succ = curr.next[i].getReference();
                if (pred.next[i].compareAndSet(curr, succ, false, false)) {
                    curr = succ;
                } else {
                    // CAS failed, fetch the new current and retry
                    curr = pred.next[i].getReference();
                }
            }
        }
        return true;
    }
}

Why “Helping” Matters: The findNode function already attempts to clean up marked nodes. This means that even if a deletion thread is preempted after marking but before physical removal, a concurrent search thread will do the cleanup work. This is the essence of lock-freedom: the system makes progress even if any single thread is delayed.


Part IV: The Range Query Renaissance - Why Skip Lists Win

Now we arrive at the crown jewel of our design: the range query. A B-Tree, while excellent for point queries and range queries on a single node, becomes a nightmare for concurrent range queries. Traversing leaf nodes requires acquiring and releasing latches on multiple pages, leading to complex lock-coupling protocols. A lock-free B-Tree is a research paper in itself. But a skip list? It’s a linked list.

1. The Unbroken Chain

A lock-free skip list is, at its heart, a lock-free sorted linked list (Level 0). To perform a range query for keys in [start, end):

  1. Locate the Start: Perform a findNode(key=start). This gives you the predecessor/successor at Level 0. The successor is the first node >= start.
  2. Traverse Forward: Starting from this successor, simply follow the next[0] pointers. You are now walking a singly-linked list.
  3. Collect Results: For each node you visit, check if its key is < end. If yes, collect its value. If not, you are done.

Because this traversal is entirely on Level 0, it is incredibly cache-friendly and predictable. There are no jumps between nodes, no recursion, no complex loop invariants. It’s a simple while loop.

# Pseudocode for a Lock-Free Range Query
def range_query(self, start_key, end_key):
    """Returns all key-value pairs where start_key <= key < end_key."""
    preds = [None] * MAX_LEVEL
    succs = [None] * MAX_LEVEL
    # 1. Find the starting node (the successor of start_key at Level 0)
    level = self.find_node(start_key, preds, succs)
    current = succs[0] # The first node >= start_key

    # 2. Traverse the Level 0 list
    results = []
    data = self.data  # Reference to the atomic pointer to the node's value
    # Use a 'read-atomic' approach for value retrieval
    while current is not None and current.key < end_key:
        # Read the value atomically (list.value is an AtomicReference)
        value = current.value.get()
        results.append((current.key, value))
        # Move to the next node. Note: we ignore the 'marked' bit here!
        # If a node is marked, we still return its value if we saw it before it was marked.
        # This is a *weak* but correct isolation model.
        current = current.next[0].getReference()

    return results

2. The Isolation Problem: What Happens During a Concurrent Range Query?

This is where the trade-offs of lock-freedom become stark. A range query is a snapshot of a range in time. In a lock-based system, a range query can acquire a read lock, ensuring the range is consistent. In a lock-free system, what happens if:

  • A node is being logically deleted (marked) as we traverse? We will see it. The algorithm above will return its value. This is an example of a weak serializability or snapshot isolation. The query is not a perfect point-in-time snapshot, but it is a consistent state. The deleted node was present at the start of the traversal and still is not fully removed.
  • A new node is inserted before the current node? This is fine. The traversal will not see it because we are moving forward through the list. The new node will be in the list, but our traversal might miss it.
  • A node is inserted just after the current node? We will see it. The new node’s next[0] pointer is set before it is linked. When we call current = current.next[0].getReference(), we might land on the newly inserted node, or we might have already passed it. The result is deterministic based on the timing of the CAS operations.

The Golden Rule of Lock-Free Range Queries: A range query on a lock-free skip list is non-blocking and O(log n + k) where k is the number of results. It does not provide strict serializability (like a database transaction). It provides a form of eventual or causal consistency for the range. This is perfectly acceptable for many use cases (metrics, logs, time-series data) but unsuitable for financial transactions where a consistent snapshot is mandatory.


Part V: The Distributed Dimension - Scaling the Skip List

We have solved two out of the three-body problem (speed + concurrency). Now we must tackle distribution. How do we take our single-node lock-free skip list and split it across N machines?

The naive approach is to put a single skip list behind a load balancer behind a distributed lock. We’ve already discussed why this fails. We need to distribute the data itself.

1. The Sharding Strategy: A Distributed Red-Black Tree Analogy

The most practical approach is sharding (also known as partitioning). We partition the key space into N ranges. Each machine (shard) hosts an independent lock-free skip list responsible for a specific segment of the key-space.

  • Key Range: 0 - 1000: Shard 1
  • Key Range: 1000 - 2000: Shard 2
  • Key Range: 2000 - 3000: Shard 3

A client or a routing layer (e.g., a consistent hash ring) knows, for any key, which shard is responsible for it.

This solves the single-machine bottleneck. We can add more machines to handle more data or more requests. Each machine’s skip list is still lock-free internally. The routing layer is typically a simple hash map or a consistent hash ring (for dynamic re-sharding).

2. The Nightmare of Distributed Range Queries

Here is where our design faces its ultimate test.

Scenario: The user wants all keys between 500 and 2500.

  • Shard 1 can handle [500, 1000).
  • Shard 2 can handle [1000, 2000).
  • Shard 3 can handle [2000, 2500).

The Naive Approach: Query all 3 shards in turn, collect results, merge them. This is slow and complex. The client must know the shard boundaries.

The Fault-Tolerant Approach: Use a scatter-gather pattern. The client sends a “RangeQuery(start=500, end=2500)” request to a single coordinator (or any node in the cluster).

  1. Fragmentation: The coordinator uses its local knowledge of the shard map to determine which shards are involved: [Shard 1, Shard 2, Shard 3].
  2. Fan-out: The coordinator sends asynchronous Range Query requests to each of these shards.
  3. Gather: The coordinator waits for all responses. If a shard is down, it can either fail the query, try a retry, or return a partial result (depending on SLA).
  4. Merge: The responses come back as sorted lists of (key, value). The coordinator must merge these sorted lists into a single sorted list. This is an O(k * log N) operation using a min-heap (Priority Queue) where k is the total number of results.
// Pseudocode for a Distributed Range Query Coordinator
public List<KVEntry> distributedRangeQuery(int startKey, int endKey) {
    // 1. Identify involved shards and their key ranges
    Set<Shard> targetShards = shardMap.getShardsForRange(startKey, endKey);

    // 2. Fan-out to shards asynchronously
    List<CompletableFuture<List<KVEntry>>> futures = new ArrayList<>();
    for (Shard shard : targetShards) {
        // The shard is given its specific sub-range to query.
        // This prevents it from scanning all of its data.
        int shardStart = Math.max(startKey, shard.rangeStart);
        int shardEnd = Math.min(endKey, shard.rangeEnd);
        futures.add(shard.asyncRangeQuery(shardStart, shardEnd));
    }

    // 3. Gather all results (blocking until all complete)
    List<List<KVEntry>> shardResults = CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
            .thenApply(v -> futures.stream()
                    .map(CompletableFuture::join)
                    .collect(Collectors.toList()))
            .get();

    // 4. Merge sorted lists using a min-heap
    PriorityQueue<IdxAndEntry> heap = new PriorityQueue<>(
            Comparator.comparingInt(a -> a.entry.key)
    );
    for (int i = 0; i < shardResults.size(); i++) {
        if (!shardResults.get(i).isEmpty()) {
            heap.offer(new IdxAndEntry(i, shardResults.get(i).get(0)));
        }
    }

    List<KVEntry> mergedResult = new ArrayList<>();
    while (!heap.isEmpty()) {
        IdxAndEntry smallest = heap.poll();
        mergedResult.add(smallest.entry);
        int shardIdx = smallest.shardIdx;
        // Move to the next element in the same shard's result list
        int nextIdx = shardIdx + 1;
        if (nextIdx < shardResults.get(shardIdx).size()) {
            heap.offer(new IdxAndEntry(shardIdx, shardResults.get(shardIdx).get(nextIdx)));
        }
    }
    return mergedResult;
}

The Cost: The distributed range query has a much higher latency than a local one (network round trips). The merge step is an overhead that does not exist in a single-node system.

3. The Masterpiece: A True Distributed Skip List

Can we do better? Can we create a single, truly distributed skip list where the pointers are network addresses (like remote procedure calls) instead of local memory addresses? This is the holy grail.

Architecture: Imagine a skip list spanning N machines. Each node is on a specific machine. The node’s next[i] pointer is not a local memory address but a (machine_id, address) tuple.

  • Search: A query starts at the head (on any machine). To traverse, it must perform an RPC to the next machine to fetch the next node’s key and its pointers. This is incredibly expensive.
  • Caching: You could cache remote pointers locally. This is the foundation of distributed B-Trees (like in Google Spanner, using Paxos for replication).
  • Challenge: This devolves into a distributed hash table (DHT) like Chord or Pastry. These systems are excellent for lookup, but terrible for range queries because keys are spread across nodes based on a random hash.

A true distributed skip list with range query support is an area of active research. The most practical solution today remains the sharded approach, where the range query is a cross-shard scatter-gather operation.


Part VI: Practical Implementation and Trade-offs

You have the theory. Now, how do you build this? Let’s look at some concrete considerations for a production implementation, using Rust or C++ for maximum performance.

1. Memory Management: The Epoch Problem

Lock-free data structures have a classic memory management problem: a thread might be reading a node while another thread is deleting it. If the deleting thread frees the memory for the deleted node, the reading thread will dereference a dangling pointer.

Solutions:

  • Epoch-Based Reclamation (EBR): The most common approach, used in many high-performance systems. A thread announces itself as being in a “read epoch.” It cannot exit the epoch until it has seen all pointer updates that were pending when it entered. This prevents memory from being freed while in use. Hazard Pointers are another variant.
  • Reference Counting with RCU: Read-Copy-Update (RCU) relies on a grace period and is heavily used in the Linux kernel. It is excellent for read-heavy workloads.

The Impact on Range Queries: During a range query, a thread is in a read epoch for the entire duration of the traversal. This prevents deletion of nodes we are about to traverse. This is a key performance consideration.

2. The Cost of CAS and False Sharing

  • CAS in a Loop: The while-loops in Insert and Delete are not free. A high-contention operation might retry hundreds of times, burning CPU cycles. This is the price of lock-freedom.
  • False Sharing: Nodes in a skip list are small. If two threads modify two different nodes that happen to be on the same cache line, they will still cause cache line invalidations for each other. This is a silent performance killer. The solution is padding nodes to align to cache line boundaries.

3. When Not to Use This

  • Need Strong Transactional Semantics: If you need ACID transactions over a range, you need a distributed transaction coordinator (like Percolator or Omid), not a lock-free skip list.
  • Write-Heavy Workloads with Small Keys: A simple sharded, non-distributed B-Tree on a single SSD can outperform a lock-free skip list for a single point query, especially if it can use hardware-accelerated writes (e.g., NVMe).
  • Small Datasets that Fit in L3 Cache: A simple sorted array with O(log n) binary search and copy-on-write can be faster than a skip list with pointer chasing.

Conclusion: The Path Forward

We began with the agony of the lock and the three-body problem of data infrastructure. We explored the lock-free skip list—a structure that cleverly uses probabilistic layering and atomic compare-and-swap operations to achieve high concurrency without the failure domain of locks. We saw how its simple Level 0 linked list makes it a star for range queries, solving the “blind spot” of most concurrent maps.

Finally, we faced the harsh reality of distribution. Sharding the lock-free skip list is a practical, battle-tested solution for scaling horizontally, but it turns a clean O(log n + k) local operation into a complex, multi-round-trip scatter-gather operation with a merge cost. The true, non-sharded distributed skip list remains a siren’s call for researchers.

The key takeaway is that no single data structure is a silver bullet. The lock-free skip list is a powerful tool in your arsenal, perfectly suited for in-memory, read-heavy, range-query-dominated workloads where latency is paramount and perfect consistency is a secondary concern. It is the silent workhorse behind many time-series databases, in-memory caches, and real-time analytics engines.

As you design your next high-performance system, ask yourself: What are my three variables? Is the trade-off of perfect consistency for unparalleled speed and fault tolerance a worthy one? For many, the answer is a resounding “yes,” and the journey begins with a single, lock-free step.

What have been your experiences with concurrent data structures? Have you ever implemented a lock-free structure in production? Share your war stories and insights in the comments below!