Designing A Multithreaded Memory Allotter: Thread Caching Malloc (like Tcmalloc)

2026-04-18 · Leonardo Benicio
Summary

A comprehensive technical exploration of designing a multithreaded memory allotter: thread caching malloc (like tcmalloc), covering key concepts, practical implementations, and real-world applications.

Contents

The Hidden War at the Heart of Your Multicore Application

The hum of the data center is a lie. Beneath the placid blinking of LEDs and the whisper of cooling fans, a silent war is being waged—a war for a few hundred nanoseconds. In the modern, multi-core era, the performance of your application is no longer solely determined by the brilliance of your algorithms or the speed of your CPU. It is increasingly dictated by one of the most overlooked, yet intimately involved, pieces of system software: the humble memory allocator.

You have likely experienced its effects without realizing it. That web server that inexplicably stalls under high load, the database that does not scale linearly with the number of cores, the game engine that stutters on a powerful processor—the root cause is often not a bug in your logic, but a chokehold at the hardware and operating system level. The moment your code calls malloc() or new, you are entering a contested intersection where multiple threads, multiple cores, and multiple cache lines converge. A naive allocator turns this intersection into a traffic jam, forcing every thread to slow down and wait for a single, centralized gatekeeper. This is the fundamental problem we must solve.

The Hidden Cost of malloc: A Journey into the Kernel

To understand why a new approach is necessary, we must first appreciate the gravity of what malloc actually does. It is not a simple “give me a block of memory” operation. Behind the scenes, the default malloc (often glibc’s ptmalloc on Linux) must manage a complex geometry of free lists, arenas, and fences. It must satisfy allocation requests of arbitrary sizes, coalesce freed blocks back into larger chunks, and occasionally ask the kernel for more memory via the mmap() or sbrk() system calls. These system calls are expensive. A single trip to the kernel involves a context switch, a change in privilege level, and a search of the virtual memory maps—an operation that can take thousands of cycles. To avoid this, a good allocator tries to reuse freed memory.

But let’s peel back the layers a bit more. When you call malloc(128), a typical allocator first checks a thread-local cache or a per-core slab. If a block of the right size is available, it returns immediately—this is the fast path, often costing only a handful of instructions. If not, it must acquire a global lock to search the central free lists, which might involve scanning through a data structure of free chunks, splitting a larger block, or merging adjacent freed blocks (coalescing). This lock acquisition is the beginning of the traffic jam. In a single-threaded environment, it’s fine. In a multi-threaded application, however, every memory allocation and deallocation becomes a contention point.

The Anatomy of a Heap

Let’s visualize a typical heap layout. The allocator maintains a collection of “bins”—linked lists of free blocks of similar sizes. For small allocations (say, up to 512 bytes), these bins are often implemented as fast bins (singly linked lists, LIFO). For medium sizes, there are unsorted bins, small bins, and large bins (in ptmalloc terminology). When you free a block, the allocator might place it into the appropriate bin, but not without holding a lock to prevent concurrent modifications.

Consider a simple scenario with two threads both doing malloc and free on a shared heap. On a dual-core machine, one core may hold the lock while the other spins, wasting CPU cycles. Worse, the cache line containing the lock metadata is constantly bouncing between the two cores (cache coherence traffic). This is not merely an inconvenience; it can degrade performance by an order of magnitude in pathological cases.

The Multicore Traffic Jam: Contention and False Sharing

The problem of contention can be illustrated with a classic microbenchmark. Imagine a server application with 64 threads, each performing 1 million allocations and deallocations of a 64-byte block. Using ptmalloc (the default on Linux), the total time can be dozens of seconds. Switching to a scalable allocator like jemalloc or tcmalloc reduces that time to a fraction. The difference is entirely due to lock contention and cache line bouncing.

But contention is only half the story. Even when the allocator is not the bottleneck, false sharing can silently throttle performance. False sharing occurs when two different data structures used by different threads happen to reside on the same cache line (typically 64 bytes). When one thread writes to its portion, the entire cache line is invalidated on the other core, forcing a costly memory fetch. A poorly designed allocator can unintentionally exacerbate this by co-locating metadata and user data, or by placing very small allocations from different threads on the same cache line.

How Allocators Contribute to False Sharing

Think about the internal data structures of an allocator: free list pointers, bin headers, arena descriptors. These are often stored in static or global arrays. If a thread on core 0 modifies a free list node while core 1 is about to read another node in the same cache line, the hardware must invalidate that line. Even if the two nodes are logically separate, the physical proximity on the DRAM row (or in the cache line) creates interference.

Modern scalable allocators mitigate this by using per-thread heaps (or “arenas”) and ensuring that metadata from different threads is placed on distinct cache lines—often by padding or aligning structures to 64-byte boundaries.

ptmalloc: The Glibc Workhorse with Flaws

Let’s examine the default Linux allocator, ptmalloc2 (Pthreads malloc), which is derived from Doug Lea’s dlmalloc and then modified for multi-threading by Wolfram Gloger. The key idea was to use multiple “arenas”: each arena is a separate heap with its own mutex. When a thread calls malloc, it is assigned (or chooses) an arena. If the arena’s lock is contended, the thread may try to create a new arena, up to a limit (typically 2× number of cores). This reduces but does not eliminate contention.

Furthermore, glibc 2.26+ introduced a per-thread cache (tcache), a thread-local pool of small blocks (up to about 1000 bytes). This dramatically speeds up single-threaded allocation and reduces lock acquisition for small objects. However, the tcache is not a panacea: it still relies on the global arenas for medium and large allocations, and the “bins” are still global. Moreover, the tcache can cause increased memory fragmentation because freed blocks are kept in thread-local caches and are not immediately available to other threads—leading to a situation where one thread holds many free blocks while another thread starves and has to request more memory from the kernel.

Real-World Impact

Consider a web server like Apache or Nginx (though Nginx uses its own slab allocator). Under high concurrency, each request thread frequently allocates and deallocates small buffers. With ptmalloc, contention on the arena locks can become a bottleneck, causing requests to queue up on memory allocation. Developers have often observed that reducing the number of worker threads (while increasing the number of I/O threads) can actually improve throughput—partly because fewer threads reduce lock contention in malloc.

Modern Scalable Allocators: jemalloc, tcmalloc, mimalloc

To address these shortcomings, several high-performance allocators have been developed. Each takes a different approach but shares common principles: minimize global locks, use thread-local storage for frequent allocations, and design data structures to reduce cache line contention.

jemalloc – The FreeBSD and Firefox Backend

Originally developed by Jason Evans for FreeBSD and later adopted by Firefox, jemalloc is renowned for its scalability and low fragmentation. Its architecture is based on “arenas” (similar to ptmalloc but more fine-grained) and “runs”. Each thread is mapped to a specific arena (or set of arenas) to reduce contention. jemalloc also uses “tcache” (thread-local cache) for small allocations, but manages larger allocations in “regions” and “chunks”.

One notable feature is metadata sharing and separation. Metadata is stored in a way that avoids false sharing. Additionally, jemalloc performs lazy coalescing and buddy allocation to control fragmentation. It also provides a profiler and facilities to tune the number of arenas per CPU.

Example: Configuring jemalloc for a database server

MALLOC_CONF=background_thread:true,abort_conf:true ./my_database_server

The background_thread option spawns a background thread to perform deferred operations like purging unused pages, which can increase throughput under heavy allocation workloads.

tcmalloc – Google’s Thread-Caching Allocator

tcmalloc (Thread-Caching Malloc) divides memory into per-thread caches for small objects (up to 32 KB). For larger objects, it uses a global heap protected by a spinlock. Each thread maintains a cache of free objects of various sizes, mimicking a centralized free list but in local storage. When the thread cache is exhausted, the thread grabs a batch of objects from the central free list. When the thread cache gets too full, it releases a batch back to the central heap.

The key advantage is that most allocations (up to 90% in typical workloads) are satisfied without any lock acquisition. The central heap is only accessed when the thread cache underflows or overflows, and that requires only a single lock acquisition per batch.

tcmalloc also includes features like page heap management and sampling profiler to help developers understand allocation patterns.

Code snippet: Using tcmalloc via LD_PRELOAD

# Link against tcmalloc at runtime
LD_PRELOAD=/usr/lib/libtcmalloc.so.4 ./my_program

mimalloc – Microsoft’s New Contender

mimalloc is a relatively new design from Microsoft Research that emphasizes free list sharding and bias-free allocation. It avoids global locks entirely by using per-thread heaps, but instead of per-thread caches (which can cause memory bloat), mimalloc uses a unique approach: it maintains a global free list of pages, and each thread has a small “local” free list that it uses to satisfy allocations quickly. The global free list is designed to be mostly lock-free, using atomic operations.

mimalloc also implements eager coalescing and array of free lists to reduce fragmentation. It is known for its speed in both single-threaded and multi-threaded scenarios, often outperforming jemalloc and tcmalloc in microbenchmarks.

Benchmark example (from mimalloc readme):

Test 1: Allocate and free 64 byte objects in 32 threads
- mimalloc:  0.018 sec
- tcmalloc:  0.026 sec
- jemalloc:  0.028 sec
- ptmalloc:  0.132 sec

NUMA Awareness: The Next Frontier

As systems move to Non-Uniform Memory Access (NUMA) architectures, memory allocation becomes even more complex. In a NUMA machine (e.g., a dual-socket server), each processor has its own memory controller. Accessing memory from the local socket is fast (low latency), while accessing memory from the remote socket is slower (higher latency). A naive allocator may allocate memory for a thread from a remote node, hurting performance.

Modern allocators like jemalloc and tcmalloc incorporate NUMA awareness. For example, jemalloc can be configured to create arenas that are bound to specific NUMA nodes, and threads are assigned to arenas based on their current CPU. This ensures that allocations are local. The Linux kernel also provides the mbind() system call to set memory policy, but allocators can do a better job at fine-grained management.

How to check NUMA allocation in your application:

numactl --membind=0 ./my_app   # bind memory to node 0
perf stat -e remote_accesses    # count remote memory accesses

Real-World Case Studies

Web Servers: The LinkedIn Story

LinkedIn engineers reported that switching from ptmalloc to jemalloc reduced tail latency in their web servers by 40%. The reason was that jemalloc drastically reduced lock contention during high concurrency. A simple change like adding LD_PRELOAD to the start script yielded significant performance gains.

Databases: MySQL with InnoDB

MySQL’s InnoDB storage engine is known to be a heavy user of dynamic memory for buffer pools, transaction logs, and internal data structures. Using tcmalloc (or jemalloc) can improve throughput by 15-30% in multi-threaded benchmarks. The Percona server distribution bundles jemalloc by default. Without a scalable allocator, InnoDB can suffer from severe mutex contention on the memory allocator.

Game Engines: The Curse of Dynamic Allocation

Game engines often avoid dynamic allocation in the hot path, but some code paths (e.g., asset loading, particle systems) still rely on malloc. A stuttering frame rate can often be traced to allocator contention. Many game engines use custom allocators (stack allocators, pool allocators) for this reason. However, for those that use general-purpose allocators, choosing mimalloc has been shown to smooth frame times.

How to Profile and Tune Your Allocator

If you suspect the memory allocator is a bottleneck, you can verify it with profiling tools.

Using perf (Linux): Look for contention on functions like malloc, free, __libc_lock_lock. High cache-miss rates or spinlock time can indicate allocator problems.

perf record -e cycles:u -e cache-misses:u -e context-switches -g ./my_app
perf report

Using heaptrack or valgrind: These can show allocation hotspots, but may slow down execution.

Replacing the allocator: The easiest way to test is to preload a different allocator:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./my_app

Compare run times and memory usage.

Tuning parameters:

  • For jemalloc: set environment variable MALLOC_CONF with options like narenas:64, tcache:true, background_thread:true.
  • For tcmalloc: set TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to control cache size.
  • For mimalloc: use the MI_PADDING option to add guard pages for debugging.

The Future: Hardware-Assisted Allocation

The war for nanoseconds is pushing innovation into hardware. Intel’s “Memory Protection Extensions” (MPK) and ARM’s “Memory Tagging Extension” (MTE) could change allocation strategies. For instance, with MPK, an allocator could use protection keys to isolate different heaps without TLB flushes. Also, persistent memory (Optane) introduces new trade-offs: allocation must be aware of wear leveling and asymmetric read/write speeds.

Emerging designs like lock-free allocators based on hazard pointers or epoch-based reclamation are still mostly in research. Practical allocators for the next decade will likely blend thread-local caches with hardware transactional memory for lock-free operations.

Conclusion: Choose Your Weapon Wisely

The memory allocator is not an invisible layer you can ignore. It is a critical piece of infrastructure that can make or break the performance of your concurrent application. The humble malloc has become the battlefield where the trends of multi-core, NUMA, and cache coherence converge. By understanding the internal struggles of ptmalloc, and by leveraging modern alternatives like jemalloc, tcmalloc, or mimalloc, you can reclaim those lost nanoseconds and restore the true performance of your multi-core hardware.

Don’t let the data center hum deceive you. The war is real, and the ammunition is awareness. Profile your application, experiment with different allocators, and tune the configuration. Your users—and your CPUs—will thank you.