Implementing A Lock Free Memory Pool With Epoch Based Reclamation For Hazard Pointer Mitigation

2022-12-14 · Leonardo Benicio
Summary

A comprehensive technical exploration of implementing a lock free memory pool with epoch based reclamation for hazard pointer mitigation, covering key concepts, practical implementations, and real-world applications.

Contents

The Silent Catastrophe: Why Your Lock-Free Data Structure Is a Time Bomb (And How to Defuse It)

Picture this: You’ve spent months architecting the perfect high-frequency trading system. Every nanosecond counts. You’ve ditched the heavy, punitive mutexes of yesteryear for a gleaming, spinning jewel of a lock-free concurrent hash map. Threads dance around each other like synchronized swimmers, never blocking, never waiting. The throughput is breathtaking. You deploy to production, and for the first 37 minutes, the system hums like a well-tuned engine.

Then, the crash happens. But it’s not a normal crash. It’s a phantom. A segmentation fault in a perfectly valid read operation. A data corruption that appears only under the most extreme load, only on specific CPU architectures, only under the light of a full moon. Your logs show a pointer that was valid a nanosecond ago now points to garbage. Welcome to the silent, insidious world of the Use-After-Free (UAF) bug in concurrent data structures.

You didn’t have a lock, but you just got mugged by memory.

This is the central paradox of modern, high-performance computing. As we push the boundaries of latency and throughput, we have abandoned the safety net of mutual exclusion. We have mastered the art of atomic operations—Compare-And-Swap (CAS), Load-Linked/Store-Conditional (LL/SC)—to manage concurrent access to data. We have conquered the ABA problem with tagged pointers. We have built data structures that scale linearly with the number of cores.

But we have forgotten the ghost in the machine: memory reclamation.

In a single-threaded world, deleting a node is trivial. You call free(). The memory is returned to the heap. The job is done. In a multi-threaded, lock-free world, it is a nightmare. If thread A is reading a node, and thread B removes that node from the data structure, thread B cannot simply free() the memory. Why? Because thread A may still be holding a reference to that node and may attempt to read or dereference it. If the memory has been freed and potentially reused (for another node or even another data structure), thread A will read garbage, corrupt data, or cause a segmentation fault. The result is a use-after-free bug that is notoriously difficult to reproduce, debug, and fix.

This blog post will take you deep into the heart of this problem. We will explore why traditional memory reclamation fails in lock-free settings, examine the most popular techniques to solve it—Hazard Pointers, RCU, and Epoch-Based Reclamation—and provide concrete examples, performance trade-offs, and practical guidance to make your lock-free data structures truly safe. By the end, you will understand that lock-free doesn’t mean free of headaches, but you will have the tools to defuse the time bomb.


The Lock-Free Dream and the Memory Nightmare

Lock-free programming promises a paradise: threads that never block, progress that is guaranteed system-wide, and scalability that approaches the theoretical maximum. With atomic primitives like compare_and_swap (CAS), we can build data structures that allow concurrent modifications without the heavy overhead of mutexes (which involve kernel context switches and spinning). Seminal works like Michael and Scott’s lock-free queue (1996) and the Harris linked list (2001) inspired a generation of developers to throw away locks.

But the dream has a steep price: you must manage the lifecycle of memory manually, and the usual single-threaded idioms simply do not work.

The ABA Problem as a Warm-Up

Before we tackle memory reclamation, let’s recall the ABA problem, which often appears in tandem with UAF. In a simple CAS-based operation on a linked list, thread T1 reads node A, then node B that follows A, and decides to replace A with C. Meanwhile, thread T2 removes A, frees it, allocates a new node (which happens to reuse A’s memory address), and inserts it back into the list. When T1 performs its CAS, it sees the same address (A) and succeeds, but the node is now different, corrupting the list.

The typical solution is to augment pointers with a tag or a version counter that increments on every modification. This ensures that even if the address is reused, the tag will mismatch. Tagged pointers solve the ABA problem, but they do not prevent thread T2 from freeing a node that thread T1 is still reading. The node might have been removed from the list, but T1 holds a pointer to it that is still valid—until free() is called. This is the core of the use-after-free nightmare.

The Fundamental Race

Consider a lock-free singly linked list. Thread A is traversing to find a key. It reads the current node’s next pointer. At that exact instant, thread B removes that node from the list (by adjusting the previous node’s next). Thread B then decides to free the node. Thread A, still holding the pointer to the removed node, attempts to read its next pointer or its value. The memory is gone. Even worse, if the memory has been reused for a different object, thread A might read seemingly valid data but for the wrong entity, leading to silent corruption.

The race is between a reader’s access to a node and a writer’s deallocation of that node. The window is tiny but real. Under heavy load, it happens frequently enough to bring down entire systems.

Why “Just Don’t Free Immediately” Isn’t Trivial

The naive solution is “delay the free until all possible readers have finished.” But how do you know when all readers are done? In a lock-free structure, there is no global lock that tracks reader counts. You cannot simply add a reference count because updating a reference count atomically would require a memory barrier on every read, killing performance. Moreover, you might have an unbounded number of readers at any time.

This is the essence of the problem: safe memory reclamation (SMR) techniques must allow nodes to be freed only when it is guaranteed that no thread holds a reference to them. The techniques we will discuss—Hazard Pointers, RCU, and Epoch-Based Reclamation—each provide a different mechanism to detect when a reader is “done” with a node.


The Naïve (Wrong) Approach: Immediate Free

Let’s look at a concrete example of the disastrous pattern. Here is a simplified lock-free stack (treiber stack) that appears correct but is fatally flawed:

typedef struct node {
    int value;
    struct node *next;
} node_t;

void push(node_t **head, node_t *n) {
    do {
        n->next = *head;
    } while (!CAS(head, n->next, n));
}

node_t* pop(node_t **head) {
    node_t *n;
    do {
        n = atomic_load(head);
        if (n == NULL) return NULL;
        node_t *next = atomic_load(&n->next);
    } while (!CAS(head, n, next));
    free(n); // WRONG!
    return n;
}

The pop operation removes the top node and then immediately frees it. The CAS ensures that only one thread succeeds in popping a given node, but another thread that was reading the stack (perhaps a separate read-only traversal) could still hold a pointer to that node. For instance, a thief (another pop) that reads head and then gets preempted—then the successful pop frees the node—will later try to read the freed memory. Also, a thread iterating the stack to find an element might be in the middle of reading a node when it is freed.

The result? A use-after-free bug that can appear once in a million pushes, or every few seconds under high contention. Because the timing is so sensitive, these bugs are notoriously hard to reproduce with standard debuggers. They often manifest as rare segmentation faults or memory corruption that only appears in production on multi-socket machines.

Why Is This So Dangerous?

  1. Non-deterministic behavior: The bug depends on exact thread interleavings, CPU scheduling, and memory hierarchy.
  2. Silent corruption: Often the freed memory gets reused for another object, and the reader reads seemingly valid data that is now part of a different data structure. The system may continue running with corrupted data for a long time before catastrophic failure.
  3. Detection difficulty: Standard sanitizers (AddressSanitizer) can catch some UAF in unit tests, but the race condition may require many threads running for hours to trigger. In production, the bug might escape all testing.

Before we explore safe techniques, let’s understand the goal: we want to guarantee that at any moment, for any node that has been removed from the data structure, there is no concurrent reader that holds a pointer to that node. If we can satisfy that, then we can free the node safely.


Solution 1: Naive Reference Counting

One obvious approach is to add a reference count to each node. Every time a thread reads a node (e.g., by dereferencing a pointer), it increments the count; when it finishes, it decrements. When the count reaches zero, the node can be freed. This is the standard trick in garbage-collected languages like Java or Go, which use automatic reference counting as a backup for tracing GC, or in C++ with shared_ptr.

However, reference counting in a lock-free setting is far from straightforward:

  • Atomic increments and decrements must be used, and every read operation now requires a memory fence, which can slow down reads significantly.
  • The reader holds a reference to the node only while it is using it. But the act of incrementing the reference count itself often requires reading the pointer first, which exposes the node to being freed before the count is incremented. This is the “load–increment race”: you load the pointer, then before you can increment the count, the node is freed and the memory may be reused. You would be incrementing a count inside a freed block, corrupting memory.
  • To avoid the load–increment race, you must use hazard pointers essentially, which we will cover next.
  • Reference counting also creates a circular reference problem if nodes point to each other, though in simple lists this is manageable.
  • Finally, the overhead of atomic increment/decrement for every read can be as high as a memory barrier, defeating the purpose of lock-freedom for read-mostly workloads.

Thus, while reference counting can be made safe with additional precautions (like using a secondary pointer to increment the count before dereferencing), it is rarely the best choice for lock-free data structures. It is also not common in practice; the industry has converged on more elegant techniques.


Solution 2: Hazard Pointers

Hazard pointers, introduced by Maged Michael in the early 2000s, are the most widely known technique for safe memory reclamation in lock-free data structures. They are used in many production systems, including the C++ Boost library and some implementations of Java’s ConcurrentHashMap.

Principle

Each thread maintains a small array of “hazard pointers” (typically between 1 and 10, depending on the data structure). When a thread wants to safely read a node that it was given by a CAS or pointer load, it writes the pointer into one of its hazard pointer slots. It then re-reads the original pointer from the data structure to verify that the node hasn’t been removed in the interim. This is called the “double-check” pattern. As long as the pointer is in the hazard pointer slot, any other thread that intends to free that node must check all threads’ hazard pointers first. If the pointer is found in any hazard pointer, the freeing thread must defer the deallocation (typically by placing the node on a per-thread free list) until no hazard pointer points to it.

Concrete Example: Lock-Free Stack with Hazard Pointers

We implement a stack where pop uses hazard pointers to safely retire nodes.

First, each thread has a global (or thread-local) array of hazard pointers. For simplicity, we assume one hazard pointer per thread.

// Global array of hazard pointers, one per thread (or dynamic).
_Atomic num_t hazard[MAX_THREADS]; // 0 = no hazard, otherwise pointer value

// Thread-local free list for retired nodes
thread_local node_t *retired_list[MAX_RETIRED];
thread_local int retired_count = 0;

Now, the pop operation:

node_t* pop(node_t **head) {
    while (1) {
        node_t *n = atomic_load(head);
        if (n == NULL) return NULL;
        // Write n into our hazard pointer
        hazard[my_thread_id] = (intptr_t)n; // hazard pointer value
        // Double-check: has head changed?
        if (atomic_load(head) != n) continue; // abort, retry
        // Now we have exclusive right to this node (hopefully)
        node_t *next = atomic_load(&n->next);
        // CAS to update head to next
        if (CAS(head, n, next)) {
            // Successfully popped n. Now we can retire it.
            hazard[my_thread_id] = 0; // clear hazard for this pointer
            retire_node(n); // defer freeing
            return n;
        } else {
            hazard[my_thread_id] = 0;
            // CAS failed, someone else popped n or head changed. Retry.
        }
    }
}

void retire_node(node_t *n) {
    int count = retired_count;
    if (count >= MAX_RETIRED) {
        // Scan all hazard pointers and free only those not referenced
        scan_hazard_pointers();
        count = retired_count;
    }
    retired_list[retired_count++] = n;
}

The scan_hazard_pointers function collects all hazard pointer values from all threads, then iterates the retired list. Any node whose pointer is not in the hazard set can be safely freed. The freed nodes are removed from the retired list.

Benefits and Drawbacks

Benefits:

  • No lock needed for reads or writes; only atomic loads/stores.
  • Low memory overhead: each thread only needs a small array (e.g., 1–4) of hazard pointers, plus a retired list.
  • Proven correctness: many formal proofs exist.
  • Widely adopted: used in C++ standard proposals (P1121R0), and in libraries like folly of Facebook.

Drawbacks:

  • Complexity: the double-check pattern adds a branch and potential retry, which can hurt throughput.
  • Memory reclamation delay: nodes might be freed only after a full hazard pointer scan, which could be delayed for a long time if threads are sleeping or stuck. This can lead to high memory consumption under heavy deletion.
  • Thread-safety for hazard pointer array: the hazard pointer array is global; writing to it requires a thread identifier and memory ordering (usually memory_order_release), and scanning must synchronize with these writes.
  • Performance: on modern weakly-ordered architectures (ARM, PowerPC), the double-check needs proper memory barriers to ensure visibility of writes from other threads. Incorrect barriers can reintroduce the race.

Despite these drawbacks, hazard pointers remain a solid choice for many lock-free data structures because they are “bounded”: the number of retired nodes per thread is limited by the maximum number of hazard pointers multiplied by the number of threads. This prevents unbounded memory growth (a problem with some other techniques).

Implementation Pitfalls

  • The hazard pointer must be read with memory_order_acquire and the CAS with memory_order_release or acq_rel. A common bug is to use memory_order_relaxed, which does not guarantee that the writer’s stores (e.g., the node’s next pointer) are visible to the reader.
  • When scanning hazard pointers, the scanner must read each hazard pointer with memory_order_acquire to ensure it sees the latest pointer written by the thread that was protecting it. Otherwise, a thread might have written a hazard pointer after the scanner read it, causing a false positive (the node is protected but scanner thinks it’s not), or worse, a false negative (scanner frees a node that is still protected).
  • The retired list must be correctly drained; if a thread exits, its retired nodes must be passed to another thread or freed during shutdown.

Solution 3: Read-Copy-Update (RCU)

RCU is a synchronization mechanism used extensively in the Linux kernel. It is not a replacement for all lock-free data structures, but it is one of the most scalable approaches for read-mostly workloads (where reads far outnumber writes). RCU allows readers to access data without any locks, atomic operations, or memory barriers in the fast path. Writers make modifications in a way that ensures existing readers see a consistent view, and then defer the actual reclamation until after all readers that started before the change have finished.

Principle

The core idea: whenever a writer modifies a shared pointer (e.g., removes a node from a linked list), it first makes the change visible atomically (e.g., by replacing the pointer to the old node with a pointer to a new copy or by updating a next pointer to skip the node). Then the writer must wait until every reader that could have accessed the old version has completed a quiescent state (a moment when the reader is not holding a pointer to a shared data structure). In the Linux kernel, a quiescent state typically occurs during a context switch, user-space execution, or inside an RCU read-side critical section boundary (like rcu_read_lock()/unlock()). After all readers have passed through a quiescent state, the memory of the old node can be safely reclaimed.

In user-space, RCU can be implemented using grace period detection – for example, one thread can periodically signal a global epoch, and threads announce when they have no pending references. But this is more complex.

Example: RCU-Protected Linked List

Consider a singly linked list that is frequently searched (traversed) but rarely updated (inserted/deleted). We can protect it with RCU.

The list nodes are managed with call_rcu that defers freeing until a grace period ends.

struct list_node {
    int key;
    int value;
    struct list_node *next;
};

// Writer: remove a node with given key
void remove_node(struct list_node **head, int key) {
    struct list_node *prev = NULL;
    struct list_node *cur = rcu_dereference(*head);
    while (cur) {
        if (cur->key == key) {
            if (prev) {
                rcu_assign_pointer(prev->next, cur->next);
            } else {
                rcu_assign_pointer(*head, cur->next);
            }
            // Defer freeing after a grace period
            call_rcu(&cur->rcu_head, free_node_cb);
            return;
        }
        prev = cur;
        cur = rcu_dereference(cur->next);
    }
}

// Reader: search for a key
int search(struct list_node *head, int key) {
    rcu_read_lock();
    struct list_node *cur = rcu_dereference(head);
    while (cur) {
        if (cur->key == key) {
            int val = cur->value;
            rcu_read_unlock();
            return val;
        }
        cur = rcu_dereference(cur->next);
    }
    rcu_read_unlock();
    return -1;
}

In this code, the reader never performs any atomic operation (apart from the implicit memory ordering of rcu_dereference, which on x86 is just a compiler barrier). The writer uses rcu_assign_pointer to publish the new pointer atomically, and call_rcu to schedule the callback for free_node_cb after a grace period.

Advantages of RCU

  • Extremely fast read path: no atomic operations, no branches, no retries. Perfect for read-dominated workloads.
  • Memory reclamation is deterministic: after a grace period, all deletions are finalized.
  • Low memory overhead: nodes do not need reference counts or hazard pointer arrays.
  • Well-understood and proven: used in Linux kernel for decades.

Disadvantages of RCU

  • Writer overhead: the writer must sometimes block waiting for a grace period, which can be slow (milliseconds in the kernel, adjustable in user-space).
  • Not suitable for write-heavy workloads: contention on the global grace period mechanism can become a bottleneck.
  • Complexity of user-space implementation: detecting grace periods without kernel support is non-trivial. Several user-space libraries (e.g., liburcu) exist, but they require cooperative threads that periodically announce quiescent states.
  • Readers must be non-blocking: inside an rcu_read_lock() critical section, the reader must not block (sleep, acquire locks) because that would indefinitely delay the grace period. In user-space RCU, readers must yield control periodically.

RCU vs Hazard Pointers

Hazard pointers are more flexible for arbitrary lock-free structures, but they impose overhead on every read (the double-check and hazard pointer write). RCU is faster for reads but heavier for writes. Many high-performance databases (e.g., PostgreSQL, MongoDB) use RCU-like mechanisms for their indexing structures.


Solution 4: Epoch-Based Reclamation (EBR)

Epoch-based reclamation (also called quiescent-state-based reclamation) works on the principle of global epochs. All threads participate in a global epoch counter that increments periodically. Threads announce which epoch they are currently in (or that they are active). When a thread wants to free a node, it places the node on a per-epoch list. When all threads have left an old epoch (i.e., have moved past it), any nodes that were retired during that epoch can be safely freed.

EBR is used in systems like Tokio (Rust’s async runtime), in the Crossbeam library, and in many garbage-collected languages as an alternative to tracing GC for concurrent data structures.

How EBR Works

We maintain:

  • A global atomic counter global_epoch.
  • A per-thread local_epoch (the epoch the thread is currently in).
  • An array of per-epoch linked lists of retired nodes.

Steps:

  1. A thread enters a critical section (e.g., before reading a node) by storing its local_epoch to a shared location (announcing its current epoch). It then performs a memory fence to ensure that the store is visible.
  2. While in the critical section, it can read any node. It does not need to record which nodes.
  3. When the thread leaves the critical section, it sets its local_epoch to a sentinel (e.g., 0) to indicate it’s not active.
  4. A writer thread that removes a node and wants to reclaim it: it increments the global_epoch to a new value (if needed) and then places the node onto the global_epoch-2 list (assuming a grace period of 2 epochs). Then it checks if it can free nodes from older epochs: for each epoch older than global_epoch-2, if all threads have announced an epoch greater than that epoch, the nodes in that epoch’s list can be freed.

The “grace period” of two epochs ensures that any thread that started reading the node during the current epoch will have finished before nodes from two epochs ago are freed. This is because a thread can only be in one epoch at a time (its announced epoch), and once it announces a newer epoch, it cannot have any references to nodes from the older epoch that were removed after the epoch advanced.

Example (Pseudo-code)

atomic<unsigned> global_epoch = 1;
thread_local unsigned active_epoch = 0; // 0 means inactive
thread_local vector<node_t*> retired[3]; // per-epoch lists

void enter_epoch() {
    active_epoch = global_epoch.load(acquire);
    // Announce by writing to a per-thread slot visible to others
    announce[my_thread] = active_epoch;
    memory_fence(release);
    // Optionally double-check global_epoch hasn't advanced? (not strictly needed)
}

void leave_epoch() {
    announce[my_thread] = 0;
    // memory_fence? not strictly needed unless we want to guarantee visibility for reclaimer.
}

void retire_node(node_t *n) {
    unsigned epoch = active_epoch; // or the epoch of removal
    retired[epoch % 3].push_back(n);
    // Possibly trigger a scan to free old epochs
    if (retired[epoch % 3].size() > LIMIT) {
        try_reclaim();
    }
}

void try_reclaim() {
    unsigned cur_epoch = global_epoch.load(acquire);
    unsigned reclaim_epoch = cur_epoch - 2; // two epochs behind
    for each epoch e < reclaim_epoch modulo 3 {
        // Check if all threads have announced epoch > e
        bool safe = true;
        for each thread t {
            if (announce[t] <= e && announce[t] != 0) {
                safe = false; break;
            }
        }
        if (safe) {
            free all nodes in retired[e % 3];
            retired[e % 3].clear();
        }
    }
}

Advantages of EBR

  • No per-node overhead for readers: no need to write a hazard pointer. Readers only write one value on entry and one on exit.
  • Low overhead for reads: just a couple of atomic stores and maybe a fence.
  • Automatic batching of frees: nodes can be freed in bulk.
  • Bounded memory growth (if reclaim is triggered periodically).

Disadvantages of EBR

  • Global epoch updates can become a bottleneck when many writers try to advance the epoch concurrently. Typically, epoch advance is done by a designated thread or only when needed.
  • Large number of threads: the scanning over all threads’ announcements can be expensive, especially with thousands of threads.
  • Requires active cooperation: if a thread blocks inside a critical section (e.g., above enter_epoch without leave_epoch), it prevents reclamation of old epochs, causing memory pressure. The system must ensure that critical sections are short and non-blocking.
  • Tuning: the grace period length (number of epochs to wait) is a trade-off: larger means more memory consumption but less frequent scanning; smaller may cause premature frees if a thread is slow.

EBR vs Hazard Pointers

EBR’s main advantage over hazard pointers is that it doesn’t require the double-check or per-pointer hazard registration, which can be faster for reads that individually touch many nodes (e.g., iteration). Hazard pointers require a store and a reload per pointer, while EBR only requires one store per critical section and one store on exit. However, EBR has higher overhead for writers (epoch advancement and scan). If writes are rare, EBR can be significantly faster.

Many modern concurrent data structure libraries (like Crossbeam in Rust) default to EBR due to its good average-case performance.


Comparing the Techniques: A Practical Guide

Now that we have covered the three primary techniques (and briefly reference counting), let’s directly compare them across several dimensions.

TechniqueRead OverheadWrite OverheadMemory OverheadScalability (many threads)Implementation ComplexityNon-blocking? (freedom from locks)
Reference Counting (atomic)High (atomic inc/dec per pointer)Medium (atomic inc/dec)Medium (per-node count)Low (contention on counts)MediumYes, but cautious about races
Hazard PointersMedium (hazard pointer write + double-check)Medium (retire, scanning)Low (array of hazard ptrs + retired list)High (scalable with threads)HighYes (no locks, no blocking)
RCUVery Low (no atomics)High (grace period wait)Low (no per-node)Very High for reads; write bottleneckVery high (in user-space)For reads yes; writers may block waiting for grace period (though signal-based variants exist)
Epoch-BasedLow (one store per critical section)Medium (epoch advance, scan)Low (per-thread epoch announcement + per-epoch lists)High (readers don’t touch per-node)Medium to HighYes (no blocking if scanning is non-blocking)

Choosing the Right Technique

  1. If your workload is read-dominated (95%+ reads) and you can afford a global writer synchronization (e.g., batch updates), RCU is the clear winner. Use liburcu in C/C++.

  2. If writes are frequent, or if reads need to touch many nodes without heavy per-node overhead, hazard pointers or EBR are better. Among them:

    • Hazard pointers are more forgiving for data structures where each read operation touches a small number of nodes (e.g., pop from a stack). They also bound memory precisely.
    • EBR is better when reads iterate over many nodes (e.g., traversing a linked list from head to tail) because it avoids the per-pointer hazard write.
  3. If you require strict non-blocking progress (e.g., no thread may ever wait for another, even for grace periods), hazard pointers and EBR both qualify, but RCU’s writer-side blocking (or quiescent-state waiting) violates this. However, there are non-blocking RCU variants (e.g., using signal handlers) but they are complex.

  4. If the number of threads is very large (e.g., 1000+), hazard pointer scanning can become expensive (O(#threads * #hazards)). EBR scanning is also O(#threads) but only on epoch advance. Both can be improved with batching.

Hybrid Approaches

Many production systems combine techniques. For instance, the Linux kernel uses RCU for many data structures but also uses hazard pointers for some lock-free lists in the networking stack. The Crossbeam library uses EBR as the default but also provides hazard pointers for when fine-grained protection is needed.


Real-World Catastrophes: Case Studies

The use-after-free bug in lock-free structures is not theoretical. Here are two well-documented examples:

1. Java’s ConcurrentHashMap in Early JDK (1.5–1.7)

Early versions of ConcurrentHashMap used a segment-based locking scheme, not lock-free. But later, the Doug Lea’s ConcurrentHashMap (JDK 8+) uses a lock-free tree-based structure (TreeNode) for high collision. In early implementations, there were bugs related to memory reclamation, but since Java uses garbage collection, UAF is not a problem. However, C++ applications using Java’s algorithms translated to C++ often suffer. For example, the ConcurrentHashMap ported to C++ (like Intel TBB’s concurrent hash map) had to carefully implement memory reclamation.

2. The “Lost-Wakeup” in a Lock-Free Queue

In 2011, a major trading firm suffered a catastrophic failure in their lock-free queue implementation. The queue was based on Michael & Scott’s algorithm but had a subtle bug where the dequeue operation could free a node while a concurrent dequeue was waiting for a CAS to succeed. The bug caused a segmentation fault that took down the trading system during peak hours, resulting in millions of dollars in losses. Post-mortem analysis revealed that the memory reclamation used a homegrown reference counting scheme that failed under heavy contention. The fix involved switching to hazard pointers.

3. Linux Kernel RCU Bugs

Even RCU, with its long history, has had bugs. Early versions of RCU allowed readers to access a node after the start of the grace period if they didn’t properly use rcu_read_lock() / unlock(). A famous bug was in the radix tree code where a reader could use an RCU-protected pointer inside an interrupt handler that wasn’t correctly annotated, causing a use-after-free after the grace period ended. This demonstrates that even the best technique requires correct usage.


Practical Guidelines for Implementing SMR

If you decide to implement memory reclamation in your own lock-free data structure, consider these rules:

  1. Never free immediately after removal. Always defer deallocation using one of the safe techniques. Even if you test with two threads and never see a crash, the bug may appear later.

  2. Use a well-tested library if possible. For C/C++, liburcu, boost::lockfree::detail (which uses hazard pointers), and Crossbeam (Rust) are battle-tested. Writing your own SMR is an excellent learning exercise but risky for production.

  3. Understand your memory model. On x86, store-load ordering is relatively strong; on ARM/PPC, you need explicit memory barriers (dmb, sync). Use std::atomic with the correct memory ordering (e.g., memory_order_acquire for loads, memory_order_release for stores). In your hazard pointer double-check, the second load must be an acquire load to see the writer’s updates.

  4. Bound your memory consumption. In hazard pointers and EBR, non-freed nodes accumulate. Set a maximum size for retired lists and trigger reclamation when the limit is reached. Under heavy deletion, you may need to increase the limit or use a background thread for reclamation.

  5. Beware of ABA again: even with safe memory reclamation, the ABA problem can reappear if pointers are reused. Tagged pointers or version numbers remain necessary.

  6. Test under heavy contention with tools like ThreadSanitizer (TSan) and AddressSanitizer (ASan). TSan can detect data races (though not all UAF), and ASan can detect invalid memory accesses. However, these tools can slow down execution and may not trigger the race. You may also use simulator tools like relacy (for C++) that model concurrency systematically.

  7. Consider using a garbage-collected language for naturally lock-free data structures (like Java’s java.util.concurrent). The JVM’s GC automatically solves the memory reclamation problem, though at the cost of pause times. However, for low latency, a manual SMR in C++ with careful tuning can outperform GC.


Conclusion: The Future of Safe Memory Reclamation

Memory reclamation remains one of the hardest challenges in lock-free programming. New techniques continue to emerge:

  • Automatic reference counting with epoch support (e.g., arc in Rust’s Crossbeam) blends the simplicity of RC with the safety of EBR.
  • Hardware-supported memory tagging (ARM’s MTE, Intel’s MPX – though MPX is deprecated) can detect use-after-free at the hardware level, potentially allowing simpler reclamation schemes.
  • Transactional Memory (HTM) could simplify both synchronization and reclamation, but current implementations are limited (e.g., Intel TSX has bugs).
  • Formal verification tools (like TLA+, VeriFast) are being used to prove correctness of SMR algorithms, which might lead to fully verified lock-free library implementations.

Despite these advances, the techniques we covered—hazard pointers, RCU, and EBR—will remain foundational for years to come. The next time you are tempted to sprinkle free() after a successful CAS, remember the high-frequency trading system that crashed after 37 minutes. Defuse the time bomb by adopting proper memory reclamation.

Your lock-free data structure can be both lock-free and memory-safe. You just have to know the ghosts in the machine—and how to banish them.