A Deep Dive Into The Go’S Scheduler: M:N Threading, Work Stealing, And System Calls

2024-10-03 · Leonardo Benicio

Summary

A comprehensive technical exploration of a deep dive into the go’s scheduler: m:n threading, work stealing, and system calls, covering key concepts, practical implementations, and real-world applications.

Contents

The Secret Weapon Behind Go’s Performance: A Deep Dive Into M:N Scheduling, Work Stealing, and the Runtime

It begins with a single keyword: go.

For the uninitiated, it looks like a formality—a syntactic sugar that spawns a lightweight thread. For the seasoned developer, however, that one word represents a fundamental shift in how we think about concurrency. It is the gateway to a system so elegantly engineered that it allows a single process to manage hundreds of thousands of concurrent tasks without breaking a sweat, while the equivalent program in C++ or Java would buckle under the weight of kernel thread overhead.

Go is fast. But its speed is not just a matter of compiled binaries or efficient garbage collection. The real magic, the true architectural marvel, lies in the Go Scheduler.

If you have ever wondered why your Go web server can handle 10,000 concurrent connections without a massive thread pool, or why goroutines feel so much “cheaper” than OS threads, the answer lies in the runtime’s ability to act as a hyper-efficient middleman between your code and the operating system. This blog post is a deep dive into that middleman. We are going to strip away the abstraction and look at the raw mechanics of the Go scheduler: the M:N threading model, the work-stealing algorithm, and the critical handling of system calls.

But before we get into the weeds of runqueues and processor slots, we need to understand why this matters so much today.

The Concurrency Crisis and the OS Thread Tax

To appreciate Go’s design, we have to revisit the recent history of computing. For decades, the dominant model for server-side programming followed a simple script: one client, one thread.

This worked reasonably well until it didn’t. The rise of multi-core processors and the shift from CPU-bound to I/O-bound workloads (think web servers, APIs, and databases) exposed a brutal truth: OS threads are expensive.

The Hidden Cost of a Kernel Thread

An operating system thread (a kernel-level thread) is a heavyweight structure. In Linux, creating a thread via pthread_create involves a system call into the kernel, allocation of a task structure (a few kilobytes), a separate kernel stack (typically 4KB to 16KB), and an initial user-space stack (often 8MB or more). The thread must be registered with the scheduler, and the kernel must maintain state for every thread—even those that are idle. Context switching between kernel threads requires a full mode switch: save all CPU registers, flush TLBs, update the page table, and load the state of the next thread. On modern CPUs, a context switch can take anywhere from a few hundred nanoseconds to several microseconds, depending on cache state.

But the real killer isn’t the switch itself—it’s the memory. With a default stack of 8 MB, 10,000 threads would consume 80 GB of virtual address space just for stacks. In practice, that’ll blow out any reasonable memory limit. Even if you use smaller stacks (e.g., via pthread_attr_setstacksize), the kernel still allocates a full page table overhead for each thread. Moreover, context switching becomes pathological when the number of runnable threads exceeds the number of CPU cores—the scheduler thrashes, spending more time switching than doing actual work.

The 1:1 Thread Model Bottleneck

Most languages that use native OS threads (C, C++, Java before virtual threads, Python’s threading module) rely on a 1:1 model—one user-facing thread corresponds to one kernel thread. This model is simple and predictable, but it imposes a hard ceiling on concurrency. A typical production Java web server might configure a thread pool of 200–400 threads. Beyond that, the overhead of thread creation and context switching starts hurting latency and throughput.

To work around this, developers adopted event-driven patterns: epoll in C, select in Java NIO, and async/await in C# and Rust. These patterns allow a single OS thread to multiplex many I/O operations, but they force the programmer to write in a continuation-passing style or use complex state machines. Go’s solution is different: it keeps the simple synchronous programming model (no async/await keywords) while letting the runtime handle the multiplexing internally.

The Rise of I/O-Bound Workloads

Modern server applications are dominantly I/O-bound: waiting for network packets, database queries, file reads, or upstream API responses. During those waits, an OS thread sits idle, consuming a full stack and scheduler resources. The holy grail is to have many more logical tasks than CPU cores, and to switch between them only when they’re actively doing work.

This is exactly where goroutines shine. They are not OS threads—they are user-space threads, managed entirely by the Go runtime. Their creation cost is tiny (a few kilobytes for the goroutine stack, which starts at 2KB and grows as needed). Context switching between goroutines happens in user space without kernel involvement, costing on the order of tens of nanoseconds—two orders of magnitude cheaper than an OS thread switch.

The Go Solution: M:N Scheduling

Go’s secret weapon is its M:N scheduler, also called a two-level scheduler. The “M” stands for Machine (OS threads), and “N” stands for number of goroutines. The runtime multiplexes N goroutines onto M OS threads, where M is typically set to the number of logical CPU cores (controlled by GOMAXPROCS).

Let’s introduce the key actors in the runtime’s scheduling drama:

G (Goroutine) – A lightweight user-space thread. Each G contains its own stack, instruction pointer, and other context (e.g., registers when it’s not running). It is the unit of concurrent execution that the programmer creates with go func().
M (Machine) – An OS thread. The M executes Goroutines. It is the heavy lifter that actually runs code on a CPU core. The number of M’s can be larger than GOMAXPROCS, especially if some M’s are blocked in system calls.
P (Processor) – The abstraction that bridges G and M. A P represents a scheduling context. The number of P’s is fixed equal to GOMAXPROCS (typically the number of CPUs). A P holds a local run queue of Goroutines that are ready to execute. An M must be associated with a P to run G’s. Think of a P as a “logical CPU” that the scheduler manages.

How the Trio Works Together

When a Go program starts, the runtime creates one M (the main OS thread) and one P. As you spawn goroutines, they are placed into the run queue of a P. The M picks a G from its associated P’s local queue and executes it. If the G blocks (e.g., on a channel operation, a mutex, or a system call), the runtime handles it without burning an OS thread (more on that later). If a P’s local queue empties, the runtime tries to steal work from other P’s. If all P’s are idle and there’s no work, the M’s go to sleep (spinning is handled as an optimization).

This design decouples the number of goroutines from the number of OS threads. You can have hundreds of thousands of G’s, but only a handful of M’s (usually equal to GOMAXPROCS) ever actively run. The rest are parked, waiting on channels, timers, or I/O completions.

The Cost of Goroutines vs OS Threads

Aspect	OS Thread (Linux)	Goroutine
Starting stack size	8 MB (default)	2 KB (initial)
Creation time	~10 microseconds	~100 nanoseconds
Context switch	~1–5 microseconds	~20–50 nanoseconds
Maximum per process	thousands	millions

Goroutines are so cheap that you can safely create one for every incoming HTTP connection, every database query, every sub-task of a parallel algorithm. The runtime handles multiplexing them onto the real OS threads efficiently.

The Scheduler’s Core: Work Stealing

The heart of the Go scheduler is a work-stealing algorithm. It ensures that all CPU cores are kept busy with minimal contention and load imbalance. Let’s understand why this is necessary and how it works.

Load Imbalance: The Natural Enemy of Parallelism

Imagine three P’s (P1, P2, P3) each with a local run queue. Suppose your application creates 10 goroutines and they all land on P1’s queue because of the way they were spawned. The other two P’s are idle, while P1 is overloaded. Without work stealing, those idle cores waste cycles.

Work stealing solves this: when a P finds its local queue empty, it picks a victim P (usually chosen randomly) and attempts to “steal” half the goroutines from the back of the victim’s local queue. This dynamically rebalances the load.

The Data Structures

The Go runtime maintains two levels of run queues:

Local run queue per P – a lock-free, bounded queue (size 256 by default). It’s fast because it avoids global locks most of the time.
Global run queue – a single, lock-protected queue used for a few special cases (e.g., goroutines created by non-blocking syscalls, goroutines that were stolen from, or when a P tries to steal from all others and fails). The global queue is drained slowly to ensure fairness.

Additionally, there’s the runnext field on each P: a single slot that holds the next goroutine to be run. This is used for scheduling priority: when a goroutine spawns a new goroutine, the child is placed in runnext, preempting the parent. This gives child goroutines a chance to run quickly, improving latency.

The Stealing Algorithm (Simplified)

Every time a P finishes executing a goroutine (or when the goroutine blocks), the scheduler triggers a schedule call. The sequence is roughly:

Check the P’s runnext slot. If set, run that goroutine next.
If empty, check the local run queue (a lock-free queue). Dequeue the next goroutine.
If local queue empty, try to steal from another P:
- Iterate over all other P’s in random order.
- For each victim, try to lock the victim’s local queue (non-blocking CAS).
- If successful, steal about half of its goroutines (the back half) and transfer them to my local queue. Then dequeue one for myself.
- If no stealing succeeds, check the global run queue.
If global queue is empty, the M (thread) may spin for a while before parking. Spinning is an optimization to avoid the overhead of putting the thread to sleep and waking it again when new work arrives.

Pseudo-Code for Work Stealing

func (p *P) schedule() G {
    // 1. Try runnext
    if g := p.runnext; g != nil {
        p.runnext = nil
        return g
    }
    // 2. Try local queue
    if g := p.localQueue.dequeue(); g != nil {
        return g
    }
    // 3. Try steal from others
    for each other P (random order) {
        if g := stealFrom(otherP); g != nil {
            return g
        }
    }
    // 4. Check global queue
    if g := globalQueue.dequeue(); g != nil {
        return g
    }
    // 5. Spin or sleep
    if spinCount < maxSpin {
        spinCount++
        // execute a short busy loop checking for work
        goto tryStealAgain
    }
    // park M and wait for work (e.g., via futex)
    park()
    return waitForWork()
}

This is oversimplified but captures the essence. The random victim selection reduces contention and ensures good load balancing in practice.

Why Work Stealing? Comparison to Other Strategies

Centralized queue (e.g., old Java thread pools): all workers pull from one shared queue. Contention becomes a bottleneck as cores increase.
Work stealing (Cilk, Go, Java ForkJoinPool): each worker has its own deque. Workers steal only when idle. Typically leads to better cache behavior and less contention.

Go’s work stealing is almost identical to the algorithm used in the Cilk parallel programming model.

Blocking and System Calls: The Runtime’s Magic

The most impressive part of the Go scheduler is how it handles blocking operations. If a goroutine blocks on something—a channel send/receive, a mutex, a system call (like read())—the runtime must not block the underlying OS thread. Otherwise, concurrency would be severely limited.

Blocking on Channels and Mutexes

When a goroutine blocks on a channel or a sync.Mutex, the runtime can simply park the goroutine and re-schedule a different one on the same M. This is cheap: the G is moved to a waiting queue (a sudog structure), and the scheduler picks the next runnable G from the local queue.

No OS thread is wasted. The runtime takes advantage of the fact that goroutines are user-space threads that can yield cooperatively.

Blocking System Calls: The Netpoller and Sysmon

The real challenge is blocking system calls like read() on a socket. If a goroutine calls syscall.Read(fd, buf), the kernel may block the entire OS thread until data arrives. This would idle one of your precious GOMAXPROCS threads.

Go’s solution is a thread per blocking call, but with a twist:

Before making a blocking system call, the runtime drops the P that is currently associated with the M. The P becomes idle and can be picked up by another M that is waiting for work.
The M (the OS thread) proceeds to make the blocking call. It is now blocked in the kernel, but that’s okay—it has no P, so it’s not wasting a logical CPU.
When the syscall returns, the M tries to reacquire a P. If none are available, the goroutine is put into the global run queue, and the M itself may be parked (or reused for another syscall).

This mechanism ensures that the number of M’s (OS threads) can temporarily exceed GOMAXPROCS when many goroutines are blocked in syscalls. But it also means the system could spawn many M’s if every goroutine hits a syscall simultaneously. To prevent unbounded M growth, the runtime uses a dedicated monitoring thread called sysmon (system monitor).

Sysmon: The Watchdog

Sysmon is a special M that runs periodically (every 10ms or so) and performs diagnostic and scheduling tasks:

Detects long-running goroutines (via preemption logic) and forces them to yield.
Checks if any M is stuck in a syscall for too long. If a syscall M hasn’t returned for a 20µs (by default), sysmon marks that M as “retired” and forces the goroutine to be rescheduled.
Handles timers, GC coordination, and netpoller polling.

The Network Poller (netpoller)

Instead of letting a goroutine block on a read() on a TCP socket, Go’s standard library uses the runtime’s netpoller. The netpoller is an integration with the OS’s I/O multiplexing mechanism (epoll on Linux, kqueue on macOS, IOCP on Windows).

When you call conn.Read(), the underlying goroutine doesn’t actually call read() directly. Instead, it registers the file descriptor with the netpoller and parks the goroutine. The netpoller is driven by sysmon or by dedicated netpoller goroutines that call epoll_wait(). When data arrives, the netpoller wakes the corresponding goroutine and moves it back to a run queue.

This achieves the holy grail: non-blocking I/O without async/await. The programmer writes synchronous-looking code; the runtime takes care of multiplexing behind the scenes.

Example: A Simple TCP Server

func handle(conn net.Conn) {
    buf := make([]byte, 1024)
    n, _ := conn.Read(buf) // <-- This goroutine parks here if no data
    conn.Write(buf[:n])    // <-- This can also park if buffer is full
}

func main() {
    listener, _ := net.Listen("tcp", ":8080")
    for {
        conn, _ := listener.Accept()
        go handle(conn) // One goroutine per connection – works great
    }
}

Under the hood, the conn.Read call suspends the goroutine. The netpoller waits for the socket to become readable, then resumes the goroutine. Meanwhile, other goroutines (from other connections) keep running on the same OS threads.

The Scheduler Cycle: Preemption, Spinning, and Fairness

In early versions of Go (pre-1.14), the scheduler was cooperatively scheduled: a goroutine would yield control only at specific points (function calls, channel operations). A tight loop without function calls could lock up the entire P and starve other goroutines.

The Preemption Problem

Consider code like:

for i := 0; i < 1e12; i++ {
    // busy work
}

Without preemption, this code would run forever on the same OS thread, denying other goroutines a chance to run. In Go 1.14, the runtime introduced non-cooperative preemption. This was a major milestone.

How it works:

The sysmon thread periodically sends a preemption signal (SIGURG on Linux) to the running M.
The signal handler checks if the current goroutine should be preempted.
If yes, the handler saves the goroutine’s register state and replaces the program counter with the scheduler’s schedule function address. When the signal returns, execution jumps into the scheduler, which yields the goroutine and picks the next one.
This is safe because the goroutine’s stack scan is already GC-safe; the signal is handled at safe points (interruptible instructions).

Preemption ensures fairness: no single goroutine can monopolize a P indefinitely. It also makes the runtime’s scheduling more deterministic for the purposes of garbage collection (stop-the-world phases).

Spinning Threads

To minimize latency when new work appears, the runtime uses spinning M’s. After a P fails to steal work and finds the global queue empty, it may keep the M busy-spinning for a short duration (on the order of tens of microseconds) before putting the M to sleep. During spinning, the M repeatedly tries to steal or check global queue. This is a trade-off: it burns CPU but reduces the wake-up latency (which would require a system call and thread scheduling). The runtime limits the number of spinning M’s to the number of P’s (or less) to avoid hogging CPU.

Fairness of the Global Queue

The global run queue is serviced at a throttle rate: every 61st schedule, the scheduler checks the global queue, even if local queues are non-empty. This ensures that goroutines that were placed in the global queue (e.g., after being unparked from a netpoller event) eventually get a chance to run.

Practical Implications: Tuning GOMAXPROCS, Goroutine Pooling, Latency vs Throughput

Understanding the scheduler helps you write better Go code and tune your applications.

Setting GOMAXPROCS

GOMAXPROCS controls the number of P’s. By default, it equals the number of logical CPUs. For CPU-bound workloads, more P’s than CPUs can lead to oversubscription and increased context switching. For I/O-bound workloads, you might actually benefit from setting GOMAXPROCS higher than CPU count (if you have many goroutines blocked on syscalls, the extra P’s are not used by those M’s, but it can help with load balancing). However, the runtime already spawns extra M’s for blocking syscalls, so the default is usually optimal.

Goroutine Pooling: Necessary?

Because goroutines are so cheap, many Go programmers advocate for creating a goroutine per task. This is fine for most applications. However, in extreme high-throughput scenarios (e.g., handling 100k+ requests per second with very small response sizes), the overhead of goroutine creation and GC pressure from stack pages may become visible. In such cases, you can use a worker pool pattern (similar to a goroutine pool) where a fixed number of goroutines read from a channel.

const numWorkers = 100
jobs := make(chan Job, 1000)
for i := 0; i < numWorkers; i++ {
    go func() {
        for job := range jobs {
            process(job)
        }
    }()
}

This limits goroutine count and can improve cache locality.

Latency vs Throughput Trade-offs

Work stealing can introduce latency variance because a goroutine may be stolen from one P to another, causing its data to be cold in caches (cache affinity disruption). The runtime does not attempt to keep goroutines on the same P (no CPU affinity). For CPU-intensive pipelines, this can hurt performance. Workarounds include batching or using P-local pools (like sync.Pool is P-aware).

Comparisons with Other Languages

Java: Threads and Virtual Threads

Traditional Java threads are OS threads (1:1). Java 21 introduced Virtual Threads (Project Loom), which are similar to goroutines—user-mode threads scheduled onto carrier threads. Virtual threads are also M:N scheduled, but the implementation uses a ForkJoinPool with work stealing. However, virtual threads still have slightly higher overhead than goroutines because they are built on top of the JVM and require continuation objects. The key difference is that Go’s scheduler is tightly integrated with the language runtime; in Java, virtual threads must cooperate with the OS via thread-per-carrier blocking, though they also handle pinning (when a virtual thread blocks in synchronized code or native methods, it can pin the carrier).

Rust: Async/Await and Tokio

Rust’s async/await model is cooperative and zero-cost, but it requires explicit async markers and .await calls. Under the hood, a runtime like Tokio uses a work-stealing scheduler similar to Go’s. However, Rust’s model is more akin to C++ coroutines—all tasks are visible at compile time, and stacks are state machines. Go is more dynamic: any function call can suspend without the programmer knowing (e.g., I/O in the standard library). This makes Go easier for beginners but less predictable for critical latency scenarios.

Erlang/Elixir: The BEAM

Erlang’s BEAM VM also uses lightweight processes (actors) and a preemptive scheduler. Its model is similar to Go’s, but its scheduling is preemptive at a higher level (reductions per process). Go’s preemption is more fine-grained (non-cooperative via signals). Both achieve massive concurrency, but Erlang’s model is more focused on fault tolerance and hot code swapping, while Go’s scheduler is designed for low-latency I/O.

Advanced Topics: Lock-Free Queues, Memory Model, and GC Interaction

The run queue implementation is a lock-free MPMC queue (multi-producer, multi-consumer). It uses a ring buffer and atomic operations (CAS) to manage enqueue/dequeue. This avoids the overhead of mutexes in the hot path.

The Memory Model and the Happens-Before Guarantee

Go’s memory model ensures that operations are sequenced appropriately across goroutines via channels, mutexes, and atomic packages. The scheduler itself respects these ordering semantics.

Interaction with Garbage Collection

The Go garbage collector is concurrent and generational. When GC starts, all P’s are stopped briefly (the “stop the world” phase), but the scheduler plays a key role: each M at a GC safe point yields and participates in stack scanning. The scheduler must ensure that goroutines are not running during certain phases of GC. Preemption ensures that goroutines are stopped quickly. The GC also influences scheduling: during GC, the runtime may create more M’s to help with concurrent marking.

Conclusion: The Elegance of the Go Runtime

The Go scheduler is a masterpiece of systems engineering. It takes a simple language keyword—go—and weaves it into a sophisticated mechanism that enables developers to write concurrent code without the cognitive overhead of callbacks, futures, or async/await keywords. Under the hood, the M:N model, work stealing, netpoller, and non-cooperative preemption combine to deliver near-perfect CPU utilization and massive I/O concurrency.

The next time you run your Go web server or parallel algorithm, remember that the runtime is not just running your code—it’s orchestrating a delicate dance between goroutines, threads, and processors. It’s a dance that makes Go one of the most productive languages for building scalable, high-performance systems.

This deep dive only scratched the surface. In a follow-up post, we’ll explore how the scheduler handles timers, how to profile and visualize scheduling with trace tools, and how to write custom scheduling policies using runtime hooks. Stay tuned.