The Implementation Of A User Level Threading Library For Cooperative Multitasking In C

2020-11-18 · Leonardo Benicio

Summary

A comprehensive technical exploration of the implementation of a user level threading library for cooperative multitasking in c, covering key concepts, practical implementations, and real-world applications.

Contents

The Illusion of Many: Building a User-Level Threading Library from Scratch

Introduction: The Juggler’s Secret

Imagine a street performer juggling three balls. A child, wide-eyed, asks, “How do you keep them all in the air at once?” The honest answer is, you don’t. You throw the first ball, then the second, then the third, and you catch each one in rapid succession. The illusion of simultaneity comes from exquisite timing—a sequence of fast, coordinated hand-offs. That is the heart of cooperative multitasking, and it is the foundation of some of the most impressive systems in computing: early operating systems, event loops in Nginx, Node.js, and even the fibers that power high-frequency trading platforms.

We live in an age of overwhelming parallelism. Your laptop’s CPU may boast eight, sixteen, or even more physical cores. The operating system preemptively slices time into quantum-thin shards, doling them out to processes and kernel-level threads with ruthless efficiency. To the average programmer, this is magic—a black box that handles concurrency. You call pthread_create(), and threads just… happen. But this magic comes at a cost. Every time a thread yields control, blocks on I/O, or gets preempted, a system call is made. The CPU traps into the kernel, saves the hardware thread state, consults a scheduler, switches context, and returns to user space. This overhead, while small for a single call, becomes a significant bottleneck when managing thousands or tens of thousands of concurrent tasks. Think of a high-frequency trading engine, a real-time game server, or an embedded sensor network running on a tiny microcontroller with no operating system at all.

User-level threading—often called “fibers,” “green threads,” or “coroutines”—regains relevance in such scenarios. The central insight is simple: if you write a program that knows exactly when it can pause and resume its own tasks, you can perform context switches entirely in user space, avoiding kernel entry and exit. This can be ten to a hundred times faster than kernel-level thread switches. Furthermore, you gain fine-grained control over scheduling policies, memory allocation, and stack management.

In this post, we’ll peel back the layers of abstraction and build a minimal user-level threading library from scratch in C. You’ll learn how context switching really works at the register and stack level, how to design a scheduler, and how to implement cooperative yielding and synchronization primitives. By the end, you’ll understand why user-level threading remains a powerful tool in a system programmer’s belt, and you’ll have the knowledge to craft your own custom concurrency mechanisms.

Part 1: The Cost of the Kernel

1.1 The Hidden Toll of `pthread_create`

Kernel threads are heavyweight. When you create a thread with pthread_create, the operating system allocates a new task structure, a kernel stack, and a user stack. It registers the thread with the scheduler, sets up signals, and may pre-allocate memory for thread-local storage. The Linux clone() system call, which underlies pthread_create, involves over a dozen kernel functions, memory management, and security checks. Even a simple pthread_yield()—which voluntarily gives up the CPU—enters the kernel, saves the calling thread’s context, and runs the scheduler to select the next thread. On a modern x8664 Linux system, a kernel context switch might take anywhere from 50 to 200 nanoseconds, but that’s the _bare switch time. When you add cache misses (the kernel runs on different code paths and data structures than user space), the effective cost can balloon to several microseconds.

Now imagine you have 10,000 concurrent connections on a web server, each handling a small bit of logic. If each connection is a kernel thread, you’re burning resources on thread creation, context switching, and memory overhead. You may run into the ulimit on threads, or cause the kernel’s scheduler to thrash—spending more time choosing which thread to run than actually executing user code.

1.2 The Case for User Space

User-level threads bypass the kernel entirely. They run on a single kernel thread (or a handful, for parallelism). The scheduler is a simple loop in a library. Yielding control is just a function call that saves a few registers and jumps to another task’s saved context. No traps, no kernel involvement, no permission checks. The performance difference is dramatic: a user-level context switch can be as fast as a dozen assembly instructions—on the order of 5–20 nanoseconds. Moreover, you can allocate exactly the stack memory you need (often 4 KB per fiber, versus 1 MB or more for a kernel thread default), allowing you to support millions of concurrent tasks within a single process.

But there’s a catch: user-level threads are cooperative. They must voluntarily yield control. If one thread blocks on I/O or enters an infinite loop, all threads on that kernel thread stall. That’s why modern green-thread implementations (like Go’s goroutines or Rust’s Tokio) often combine user-level scheduling with asynchronous I/O and work-stealing thread pools to avoid blocking the entire execution context.

Part 2: Context Switching – The Core Mechanism

2.1 What Is a Thread, Really?

From the CPU’s perspective, a thread is a set of register values and a stack. The registers include the instruction pointer (RIP on x86_64, PC on ARM), the stack pointer (RSP), general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, R8–R15 on x86_64), and the flags register (RFLAGS). Additionally, floating-point and SIMD state (XMM registers) and signal masks may be part of the context, but for minimal fibers we can ignore them or save them on demand.

To switch from thread A to thread B, we need to:

Save A’s registers (except those that the callee-saves convention preserves) to a memory buffer.
Load B’s saved registers from its buffer.
Jump to B’s saved instruction pointer.

The complication is that the stack is also part of the thread’s state. When A is running, its local variables live on its stack. When we switch, we must also switch the stack pointer. So saving and restoring the stack pointer effectively changes the active stack.

2.2 Anatomy of a Context Switch in Assembly

Let’s implement a minimal context switch for x86_64 using setjmp/longjmp or raw assembly. For educational purposes, we’ll write our own. The classic approach is to use the ucontext family (makecontext, swapcontext) provided by POSIX, but those require linking with a library (sometimes -lucontext) and have overhead. We’ll build a leaner version.

We need a structure to hold the saved registers:

typedef struct {
    long r15, r14, r13, r12, rbx, rbp;
    long rip, rsp;
    // We could also save rdi, rsi, rcx, rdx, etc., but caller-saved registers
    // are already saved by the calling function if needed.
} ctx_t;

We’ll follow the System V AMD64 ABI, where registers rbx, rbp, r12–r15 are callee-saved. The instruction pointer (RIP) and stack pointer (RSP) are implicit in call/ret and the call stack. To do a manual switch, we need to save and restore those callee-saved registers plus the stack pointer. The instruction pointer is saved as the return address when we call the switch function.

The assembly routine for switch_to(ctx_t *next) could look like:

.global switch_to
switch_to:
    # Save current context (callee-saved regs, stack pointer, and return address)
    # The current stack pointer is in RSP.
    # We need to put it into the struct pointed to by the first argument.
    # The first argument is the current context pointer.
    # We'll assume a simple scheme: the caller passes two pointers:
    # switch_to(current, next)
    # where current is a pointer to store the current context,
    # and next is a pointer to load the next context.
    # This is similar to swapcontext.

However, writing cross-platform assembly is tedious. Let’s simplify by using the POSIX ucontext API for prototyping, then show the manual approach for the final library. The ucontext functions handle saving/restoring the full register set including floating point, and are portable across POSIX systems. But we’ll later strip that down for performance.

Code example: Using ucontext

#include <ucontext.h>
#include <stdio.h>
#include <stdlib.h>

#define STACK_SIZE 32768

ucontext_t main_ctx, fiber_ctx;

void fiber_function() {
    printf("Hello from fiber!\n");
    setcontext(&main_ctx); // yield back to main
}

int main() {
    char *stack = malloc(STACK_SIZE);
    getcontext(&fiber_ctx);
    fiber_ctx.uc_stack.ss_sp = stack;
    fiber_ctx.uc_stack.ss_size = STACK_SIZE;
    fiber_ctx.uc_link = &main_ctx; // if fiber returns, go to main
    makecontext(&fiber_ctx, fiber_function, 0);

    printf("Main before switch\n");
    swapcontext(&main_ctx, &fiber_ctx);
    printf("Main after switch\n");
    free(stack);
    return 0;
}

This demonstrates the essence: swapcontext saves the current context (including the stack pointer, instruction pointer, and callee-saved registers) into the first argument, and loads the second argument, resuming execution there. The makecontext sets up a new context pointing to a function with a given stack.

2.3 Our Own Context Switch – Minimal Assembly

To avoid the overhead of ucontext (which saves a lot of state we might not need), let’s write a custom ctx_switch in assembly. We’ll define a struct ctx that holds only the bare minimum: callee-saved registers and the stack pointer. The instruction pointer is implicitly stored as the return address.

We’ll design the switch function to store the current context and load the next:

// Save current registers into *old, then load *new.
void ctx_switch(ctx_t *old, ctx_t *new);

Assembly (x86_64, AT&T syntax):

.global ctx_switch
ctx_switch:
    # Save callee-saved registers and stack pointer into old.
    # old is in %rdi (first argument), new is in %rsi (second argument).
    movq %r15, (%rdi)
    movq %r14, 8(%rdi)
    movq %r13, 16(%rdi)
    movq %r12, 24(%rdi)
    movq %rbx, 32(%rdi)
    movq %rbp, 40(%rdi)
    movq %rsp, 48(%rdi)
    # Save the return address (IP)
    movq (%rsp), %rax
    movq %rax, 56(%rdi)
    # Load new context
    movq (%rsi), %r15
    movq 8(%rsi), %r14
    movq 16(%rsi), %r13
    movq 24(%rsi), %r12
    movq 32(%rsi), %rbx
    movq 40(%rsi), %rbp
    movq 48(%rsi), %rsp
    # Jump to the saved IP.
    # We need to push the IP onto the new stack so that ret will go there.
    # Alternatively, we can use a jmp to the address. We'll push it.
    pushq 56(%rsi)
    ret

But careful: When we save the current context, the stack pointer at the time of the call is pointing to the return address. So movq (%rsp), %rax captures the return address. When loading, we push that address onto the new stack and then ret to it. This effectively switches execution.

One nuance: The stack we are loading must already have the return address of the target function. For the initial thread creation, we need to set up a stack that contains a return address pointing to a function that will eventually yield back. Setting up a new fiber context requires manually crafting the stack: allocating a block of memory, setting RSP to the top of that block minus space for a return address, and placing the address of the fiber entry function as the return address. Then we also need to set up the initial callee-saved registers to zero (except RBP can be zero as a frame pointer sentinel). This is a few lines of C code plus inline assembly.

C helper to create a context:

void ctx_init(ctx_t *ctx, void (*func)(void), void *stack, size_t stack_size) {
    // Stack grows downward on x86_64.
    // We'll treat the top of stack as the initial RSP.
    uintptr_t *sp = (uintptr_t *)((char *)stack + stack_size);
    // Make room for return address (and maybe a sentinel).
    *--sp = (uintptr_t)func; // This will be the first "return address" when we switch.
    // Set the saved stack pointer to this location.
    ctx->rsp = (long)sp;
    // Other registers can be zero; RIP is implicit from the return address.
    ctx->rbp = 0;
    ctx->rbx = 0;
    // ...
}

Then to switch to the fiber, we call ctx_switch(&current, &fiber_ctx). The fiber runs, and when it wants to yield, it calls ctx_switch(&fiber_ctx, &current) again. This is the yield mechanism.

Now we have the core primitive: a lightweight, user-space context switch that does not touch the kernel.

Part 3: Building a Scheduler

3.1 The Ready Queue

A scheduler manages a collection of fibers. The simplest scheduler is a round-robin queue: a linked list of fiber control blocks (FCBs). Each FCB contains:

The saved context (ctx_t)
A stack (allocated dynamically)
A status (READY, BLOCKED, FINISHED)
A link to the next FCB in the queue
Optional: fiber ID, name, priority

When a fiber yields (calls yield()), it performs a context switch back to the scheduler. The scheduler then picks the next ready fiber from the queue and switches to it. This is cooperative: fibers must call yield() to give others a chance.

Let’s design a simple API:

typedef void (*fiber_func_t)(void *arg);

void fiber_scheduler_init();
int  fiber_create(fiber_func_t func, void *arg, size_t stack_size);
void fiber_yield();
void fiber_scheduler_run(); // never returns

The scheduler itself runs on the main context (the kernel thread’s stack). We’ll store the scheduler’s context in a global variable scheduler_ctx. When fiber_scheduler_run() is called, it starts the first fiber.

3.2 Implementation Sketch

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define DEFAULT_STACK_SIZE 16384

typedef struct ctx { /* ... */ } ctx_t;

// Function declarations
extern void ctx_switch(ctx_t *old, ctx_t *new);
void ctx_init(ctx_t *ctx, void (*func)(void), void *stack, size_t size);

typedef enum { READY, BLOCKED, FINISHED } fiber_state;

typedef struct fiber {
    ctx_t ctx;
    char *stack;
    size_t stack_size;
    fiber_state state;
    struct fiber *next;
} fiber_t;

static fiber_t *ready_queue = NULL;
static fiber_t *current_fiber = NULL;
static ctx_t scheduler_ctx;
static int initialized = 0;

The fiber_create function allocates a fiber_t, allocates a stack, initializes the context to run a wrapper function that will call the user function, and adds the fiber to the ready queue.

The wrapper function (fiber_entry) will call the user function, then mark the fiber as finished and yield back to the scheduler. The scheduler will then remove it from the ready queue (or just skip it if it’s finished).

fiber_yield saves the current fiber’s context into current_fiber->ctx, then switches to the scheduler’s context:

void fiber_yield() {
    fiber_t *old = current_fiber;
    // Save current context into old->ctx, then load scheduler_ctx.
    ctx_switch(&old->ctx, &scheduler_ctx);
}

But wait—when the scheduler switches to a fiber, it must save its own context into scheduler_ctx, then load the fiber’s context. So the scheduler loop looks like:

void fiber_scheduler_run() {
    initialized = 1;
    while (1) {
        // Dequeue next ready fiber
        if (ready_queue == NULL) {
            if (all fibers finished) break;
            // else idle (spin or wait for blocked fibers--we'll skip for now)
        }
        fiber_t *fib = ready_queue;
        // remove from queue (simple list, dequeue front)
        ready_queue = fib->next;
        current_fiber = fib;
        // Switch to fiber: save scheduler_ctx, load fib->ctx
        ctx_switch(&scheduler_ctx, &fib->ctx);
        // After fiber yields or finishes, control returns here.
        // If fiber finished, free its stack and struct.
        if (fib->state == FINISHED) {
            free(fib->stack);
            free(fib);
        } else if (fib->state == READY) {
            // re-add to end of queue
            // ... add to tail
        }
        current_fiber = NULL;
    }
}

3.3 The Entry Wrapper and Stack Management

When we create a fiber, we need a function that calls the user’s function and then finishes. We can set up the stack so that the initial context’s RIP points to a small assembly or C function that calls the user function and then marks the fiber as finished. However, we need to be able to pass an argument. The System V ABI passes the first argument in RDI. So when we switch to the fiber, we can set RDI to the user argument before the switch. But we already loaded the context from the saved registers. A simpler approach: the initial context’s stack should contain a “bootstrap” function that will be called with the user function and argument.

We can write a small C function fiber_bootstrap(void (*func)(void*), void *arg) that calls func(arg) then calls fiber_exit(). We set the initial RIP to this bootstrap function, and we set up the initial stack such that the first “return” from ctx_switch will jump to the bootstrap. But we need to pass arguments. The bootstrap function receives its arguments in registers as per ABI (RDI, RSI). We can set up the initial context’s registers to have those values. So during ctx_init, we can set the saved RDI and RSI to the function and arg. However, our ctx_t does not include caller-saved registers. That’s fine because when we ctx_switch into the fiber, we are essentially “calling” the fiber entry point. The registers RDI, RSI, etc. are not preserved across the switch anyway. But we need to have them set before the ret instruction. We can push them onto the stack before the return address? That would make them arguments to the entry function? No, that would be interpreted as stack arguments.

A cleaner way: Modify the context switch to also set RDI and RSI from the new context’s saved frame, but then we need to save them in the ctx struct. For minimalism, we can avoid passing arguments and use a global or thread-local variable. But let’s be thorough.

We can extend ctx_t to include RDI and RSI (caller-saved) only for initial setup. In practice, we can treat the initial context switch as a jump to a function that expects its arguments already in registers. So we can set the initial stack to contain a dummy return address that points to a “fiber entry” assembly that sets RDI and RSI from a known memory location. But that’s messy.

Simpler solution: The bootstrap function is called with no arguments. Instead, we store the user function pointer and argument in the fiber control block, and the bootstrap reads those from a known global (or from the fiber struct using a thread-local pointer). Since we are single-threaded (one kernel thread), we can set a global current_fiber before switching, and the bootstrap can access it.

static fiber_t *current_fiber; // global set before switch

void fiber_bootstrap() {
    fiber_t *fib = current_fiber;
    fib->func(fib->arg);
    fiber_exit();
}

In fiber_scheduler_run, before switching to a fiber, we set current_fiber = fib. Then when the fiber runs, fiber_bootstrap reads the function and argument from the current fiber. This works because we have a single kernel thread.

3.4 A Complete Minimal Library (Core)

Now we have enough to implement the scheduler, create fibers, and yield. Let’s write the missing pieces.

Header file (simplified):

// fiber.h
#ifndef FIBER_H
#define FIBER_H

#include <stddef.h>

typedef void (*fiber_func_t)(void*);

void fiber_init();
int fiber_spawn(fiber_func_t func, void *arg, size_t stack_size);
void fiber_yield();
void fiber_exit();
void fiber_run(); // starts scheduling, never returns

#endif

Implementation (fiber.c):

#include "fiber.h"
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

// Assembly ctx_switch
extern void ctx_switch(void *old, void *new);

static void ctx_init(ctx_t *ctx, void *stack, size_t size) {
    uintptr_t *sp = (uintptr_t *)((char *)stack + size);
    // Push a return address pointing to fiber_bootstrap (defined in assembly or C)
    // We'll use a C function and need to align stack to 16 bytes? System V ABI requires RSP mod 16 == 0 before call.
    // We'll push a dummy return address first (0), then the bootstrap address.
    *--sp = 0; // dummy return address if bootstrap returns (should not happen)
    *--sp = (uintptr_t)fiber_bootstrap; // this will be the first "return address"
    ctx->rsp = (long)sp;
    // Clear other registers
    ctx->rbp = 0;
    ctx->rbx = 0;
    ctx->r12 = 0;
    ctx->r13 = 0;
    ctx->r14 = 0;
    ctx->r15 = 0;
    // Note: RIP is implicit from the stack return address.
}

typedef enum { READY, BLOCKED, FINISHED } fiber_state;

typedef struct fiber {
    ctx_t ctx;
    char *stack;
    size_t stack_size;
    fiber_state state;
    fiber_func_t func;
    void *arg;
    struct fiber *next;
} fiber_t;

static fiber_t *ready_head = NULL;
static fiber_t *ready_tail = NULL;
static fiber_t *current = NULL;
static ctx_t scheduler_ctx;
static int initialized = 0;

// Forward declaration
static void fiber_bootstrap();

void fiber_init() {
    if (!initialized) {
        initialized = 1;
        ready_head = ready_tail = NULL;
        current = NULL;
    }
}

int fiber_spawn(fiber_func_t func, void *arg, size_t stack_size) {
    if (!initialized) return -1;
    fiber_t *fib = malloc(sizeof(fiber_t));
    if (!fib) return -1;
    fib->stack = malloc(stack_size);
    if (!fib->stack) { free(fib); return -1; }
    fib->stack_size = stack_size;
    fib->func = func;
    fib->arg = arg;
    fib->state = READY;
    ctx_init(&fib->ctx, fib->stack, stack_size);
    // Enqueue
    fib->next = NULL;
    if (ready_tail) {
        ready_tail->next = fib;
        ready_tail = fib;
    } else {
        ready_head = ready_tail = fib;
    }
    return 0;
}

void fiber_yield() {
    if (current == NULL) return; // called from scheduler? shouldn't
    fiber_t *old = current;
    // Save current context into old->ctx, then load scheduler_ctx.
    ctx_switch(&old->ctx, &scheduler_ctx);
}

void fiber_exit() {
    if (current) {
        current->state = FINISHED;
        fiber_yield(); // yields to scheduler
    }
}

void fiber_run() {
    while (1) {
        // Dequeue a ready fiber
        fiber_t *fib = ready_head;
        if (!fib) break; // no more fibers
        ready_head = fib->next;
        if (ready_head == NULL) ready_tail = NULL;
        fib->next = NULL;
        current = fib;
        // Switch to fiber
        ctx_switch(&scheduler_ctx, &fib->ctx);
        // After yield or exit
        if (current->state == FINISHED) {
            free(current->stack);
            free(current);
        } else {
            // Re-enqueue if still ready (should not happen if we only yield, but for block/resume later)
            // For simplicity, we only re-enqueue if state is READY after yield.
            if (current->state == READY) {
                // add to tail
                current->next = NULL;
                if (ready_tail) {
                    ready_tail->next = current;
                    ready_tail = current;
                } else {
                    ready_head = ready_tail = current;
                }
            } else {
                // BLOCKED? We'll need a block queue. For now ignore.
            }
        }
        current = NULL;
    }
    // All fibers finished
    printf("Scheduler: all fibers done.\n");
}

Now we need to define ctx_switch and fiber_bootstrap. The bootstrap function should be written in assembly to avoid the C calling convention interfering with the raw context. But we can also write it as a C function that never returns normally, relying on the fact that it calls fiber_exit() which yields. However, the C function’s prologue and epilogue may corrupt registers. It’s safer to write it in assembly. Let’s write a minimal assembly file.

fibers.asm (NASM syntax or AT&T? I’ll use AT&T with GNU as):

.global ctx_switch
.type ctx_switch, @function
ctx_switch:
    # Save callee-saved registers to the old context structure (rdi)
    movq %r15, (%rdi)
    movq %r14, 8(%rdi)
    movq %r13, 16(%rdi)
    movq %r12, 24(%rdi)
    movq %rbx, 32(%rdi)
    movq %rbp, 40(%rdi)
    movq %rsp, 48(%rdi)
    # Save return address (the address we will return to)
    movq (%rsp), %rax
    movq %rax, 56(%rdi)
    # Load new context from rsi
    movq (%rsi), %r15
    movq 8(%rsi), %r14
    movq 16(%rsi), %r13
    movq 24(%rsi), %r12
    movq 32(%rsi), %rbx
    movq 40(%rsi), %rbp
    movq 48(%rsi), %rsp
    # Now we need to jump to the saved instruction pointer.
    # Push the IP onto the stack and then ret.
    pushq 56(%rsi)
    ret

.global fiber_bootstrap
.type fiber_bootstrap, @function
fiber_bootstrap:
    # At this point, we are running in a new fiber.
    # The global current points to the fiber_t.
    # We need to call current->func(current->arg)
    # Load current address from global variable.
    # But we cannot access C global directly in asm without extern. Simpler: use a wrapper.
    # Let’s call a C function that does the work.
    # However, we must be careful with stack alignment.
    # For simplicity, we'll call a C function fiber_bootstrap_c.
    movq current(%rip), %rdi   # first argument: fiber_t
    call fiber_bootstrap_c
    # Should not return, but if it does, exit.
    # We can call fiber_exit via plt? Better to just call exit.
    movq $0, %rdi
    call exit

But accessing C global current in assembly requires linking. We can write the bootstrap entirely in C by using a trampoline that sets up the stack correctly. Actually, we can implement fiber_bootstrap as a C function that does not expect a normal call stack. If we call it from assembly with a proper stack frame, it will work. The tricky part is that the initial stack we set up in ctx_init has a return address of fiber_bootstrap. When we ret from ctx_switch, we land at fiber_bootstrap. The stack pointer at that moment points to the dummy return address (0). So we have a valid stack (just with garbage). We can call a C function from there as long as alignment is correct. System V requires RSP to be 16-byte aligned before a call. Our stack setup: we pushed a 0 (8 bytes) then the bootstrap address (8 bytes). So RSP after the ret is below the pushed 0? Actually, after ret, RSP is incremented by 8, so it points to the next address (the location of the 0). So RSP is not aligned. We need to align it. Let’s adjust the stack initialization to ensure 16-byte alignment after the ret. For example, push an extra 8 bytes of padding. Or we can ignore alignment; many programs work without it but may cause issues with SSE instructions. For safety, we’ll align.

Let’s set up the initial stack as:

RSP initially at stack_top.
Push a dummy value (8 bytes) for alignment.
Push 0 (initial return address? Actually, we want after the bootstrap runs (if it ever returns) to go somewhere safe, maybe to fiber_exit). Let’s push a value that points to fiber_exit as a fallback.
Push the bootstrap address.

So the stack layout from high to low:

(high) stack_top
alignment padding (8 bytes)
fallback return address (pointer to fiber_exit)
bootstrap address (lowest)

When ctx_switch does ret, it pops the bootstrap address and jumps there, leaving RSP pointing to the fallback address. So the stack is now: fallback address at RSP, then padding above. The fallback address will be the return address for bootstrap if it ever does a ret. That’s fine; if bootstrap calls fiber_exit, it will never return. If it does return (bug), it will go to fiber_exit which yields.

Thus, after ret, RSP is aligned to 8 bytes but not necessarily 16. With padding, we can ensure alignment. Add an extra 8 bytes before the fallback address: push a dummy, then fallback, then bootstrap. After ret, RSP points to fallback. That’s 8 bytes below the dummy. If we want RSP mod 16 == 0 at the point of a call inside bootstrap, we need RSP before a call to be 16-byte aligned. Since we just did a ret, RSP = &fallback. If we call a function from bootstrap, we will push the return address (8 bytes), so RSP becomes &fallback - 8, which is not 16-byte aligned if &fallback was 8 mod 16. We need &fallback to be 0 mod 16. So we need the initial stack to be set up so that after the ret, RSP is 0 mod 16. With a dummy, we can achieve that by adjusting the base.

Given the complexity, many practical implementations simply accept misalignment or use compiler attributes to ignore SSE. For our blog, we can assume it’s okay.

Instead, we can simplify by not using a bootstrap assembly; we can make fiber_bootstrap a C function that is called directly from ctx_switch as a normal function. But then we need to set RSP to a valid stack before the call. Actually, in ctx_init, we have set RSP to the top of stack minus some offsets. The first instruction after ctx_switch will be at the bootstrap address, but the stack pointer will be exactly what we set in the saved RSP. So we can set RSP to a value that already has the bootstrap address pushed? Wait, we are mixing two models.

Let’s step back. The standard approach for user-level threading libraries (like GNU Pth, libfiber, etc.) is to use makecontext/swapcontext which abstract all this. But for our custom version, a cleaner method is to treat the context switch as a coroutine: the scheduler calls a switch function that swaps the stack and jumps to a saved instruction pointer. The instruction pointer is set to an assembly function that then calls the fiber’s entry point.

We can avoid the bootstrap assembly entirely by making the fiber entry point a C function that takes no arguments and never returns. In ctx_init, we set the initial IP (via the return address on stack) to a function fiber_entry that is written in C. That function will read the current fiber’s function and argument from the global current and call it. But we must ensure that when we switch to this fiber, the stack is valid and the registers are set appropriately. Since we control the initial stack in ctx_init, we can set the stack pointer to a location that contains a return address pointing to fiber_entry. When ctx_switch does ret, it goes to fiber_entry. At that point, the stack pointer is whatever we set in RSP prior to the switch (minus the popped return address). So we need to push the return address onto the stack before we set the saved RSP. This is what we did earlier.

Thus, fiber_entry can be a normal C function:

void fiber_entry() {
    fiber_t *fib = current;
    fib->func(fib->arg);
    fiber_exit();
}

Now the question of stack alignment: The C function fiber_entry expects a properly aligned stack. When we ret into it, RSP is incremented by 8 from the saved RSP. So if we saved RSP pointing just below the return address, then after ret, RSP points to the next stack location. That’s the default behavior. To ensure alignment, we need to make sure the saved RSP is aligned such that after adding 8 (pop of return address), RSP is 16-byte aligned. That means saved RSP should be (0 mod 16) - 8? Actually, let’s define: saved RSP value is X. Then after ret, RSP = X + 8. We want X+8 ≡ 0 (mod 16) → X ≡ 8 (mod 16). So the saved RSP should be 8 mod 16. Therefore, in ctx_init, we set the initial stack pointer to an address that is 8 mod 16. For a stack that grows downward, we can adjust accordingly. For simplicity, in our example we can ignore alignment and just assume it works on x86_64 without SSE vectorized functions. Many programs do this.

Let’s now complete the example with a simple test.

Test program:

#include "fiber.h"
#include <stdio.h>

void task1(void *arg) {
    int id = (int)(long)arg;
    for (int i = 0; i < 3; i++) {
        printf("Fiber %d: iteration %d\n", id, i);
        fiber_yield();
    }
    printf("Fiber %d: done\n", id);
}

int main() {
    fiber_init();
    fiber_spawn(task1, (void*)1, 4096);
    fiber_spawn(task1, (void*)2, 4096);
    fiber_run();
    return 0;
}

Compile with gcc: gcc -o test_fiber test.c fiber.c fibers.asm (requires linking .o from asm). Run and you’ll see interleaved output. This is cooperative multitasking in action.

Part 4: Adding Synchronization Primitives

Now that we have fibers that can yield, we need to coordinate them. The classic primitives are mutexes and condition variables. In a user-space threading library, these are implemented without kernel calls—just atomics and the scheduler.

4.1 A User-Level Mutex

A mutex has two states: locked and unlocked. We need to support blocking: if a fiber tries to lock a mutex that is already locked, it should block (not spin), meaning it yields control and is only rescheduled when the mutex becomes available. The scheduler must maintain a queue of fibers blocked on each mutex.

We can implement a simple mutex using a flag and a queue of FCBs. The lock operation:

typedef struct {
    int locked;
    fiber_t *waiting_head;
    fiber_t *waiting_tail;
} fiber_mutex_t;

void fiber_mutex_init(fiber_mutex_t *mtx) {
    mtx->locked = 0;
    mtx->waiting_head = mtx->waiting_tail = NULL;
}

void fiber_mutex_lock(fiber_mutex_t *mtx) {
    // Disable preemption? Not needed if cooperative.
    if (mtx->locked) {
        // Current fiber must block
        fiber_t *fib = current;
        fib->state = BLOCKED;
        // Add to waiting queue
        fib->next = NULL;
        if (mtx->waiting_tail) {
            mtx->waiting_tail->next = fib;
            mtx->waiting_tail = fib;
        } else {
            mtx->waiting_head = mtx->waiting_tail = fib;
        }
        // Yield to scheduler
        fiber_yield();
        // When we resume, we have the lock (the unlocking fiber gave it to us)
        // No need to set locked again; unlock will have set it and moved us to ready queue.
        // However, note that the scheduler will have put us back on ready queue, not done automatically.
        // We'll need to modify the unlock to move a waiting fiber to ready queue and lock it.
        // For now assume unlock handles it.
    } else {
        mtx->locked = 1;
    }
}

void fiber_mutex_unlock(fiber_mutex_t *mtx) {
    if (mtx->waiting_head) {
        // Dequeue a waiting fiber
        fiber_t *fib = mtx->waiting_head;
        mtx->waiting_head = fib->next;
        if (mtx->waiting_head == NULL) mtx->waiting_tail = NULL;
        fib->state = READY;
        // Add to ready queue (scheduler's queue)
        // We need access to the scheduler's ready queue. Better to have a global function.
        // enqueue(fib);
        // Then yield? Actually we can just unlock and let the scheduler handle later.
        // But we must ensure the lock is transferred: the waiting fiber assumes it now holds the lock.
        // So we set locked to 1? If we unlock, we set locked=0, but then the waiting fiber will try to lock again.
        // Instead, we can just move the waiting fiber to ready queue with the understanding that when it runs,
        // it will set locked. To avoid race, we can set locked=1 and let the waiting fiber clear it? No.
        // Simpler: the waiting fiber, when it wakes, should not try to lock again; it should assume lock is held.
        // So in unlock, we don't change locked flag; we just wake a waiter, which will then set locked = 1? Actually if we keep locked=1, the lock remains taken? The waiter becomes the new holder.
        // So we leave locked=1, and the waiter resumes directly (does not need to acquire again).
        // But we need to ensure that the scheduler puts the waiter into the ready queue; then when it runs, it will be inside fiber_mutex_lock after the yield, and since we didn't set locked=0, it will just return (because locked is still 1? Wait, the waiter blocked because locked was 1. When it wakes, it's still 1. It will loop and block again.
        // So we need to set locked=0 before waking, but then the waiter will try to lock and succeed, but there's a risk that another fiber could grab it in between. Since cooperative, no other fiber runs until we yield. So we can set locked=0, wake the waiter, and then the waiter's loop will see locked=0 and set it to 1.
        // Better: use a condition variable pattern: the lock is a flag, waiters go to sleep. Unlock sets flag=0 and wakes one waiter. The waiter then attempts to acquire again (spins until flag=0, then sets it). Since cooperative, it's safe.
        // Let's implement that.
    }
    mtx->locked = 0;
}

But we need a way to wake a fiber: we need a global function scheduler_enqueue(fiber_t *fib) that adds the fiber to the ready queue. Let’s add that to the scheduler.

Modify fiber.c to expose a function:

// In fiber.h
void fiber_enqueue(fiber_t *fib);

// Implementation
void fiber_enqueue(fiber_t *fib) {
    fib->next = NULL;
    if (ready_tail) {
        ready_tail->next = fib;
        ready_tail = fib;
    } else {
        ready_head = ready_tail = fib;
    }
}

Now the mutex unlock:

void fiber_mutex_unlock(fiber_mutex_t *mtx) {
    if (mtx->waiting_head) {
        fiber_t *fib = mtx->waiting_head;
        mtx->waiting_head = fib->next;
        if (mtx->waiting_head == NULL) mtx->waiting_tail = NULL;
        fib->state = READY;
        fiber_enqueue(fib);
    }
    mtx->locked = 0;
}

And the lock:

void fiber_mutex_lock(fiber_mutex_t *mtx) {
    while (mtx->locked) {
        // Block
        fiber_t *fib = current;
        fib->state = BLOCKED;
        fib->next = NULL;
        if (mtx->waiting_tail) {
            mtx->waiting_tail->next = fib;
            mtx->waiting_tail = fib;
        } else {
            mtx->waiting_head = mtx->waiting_tail = fib;
        }
        fiber_yield();
        // After wake, we will be ready. Loop to try again.
    }
    mtx->locked = 1;
}

Since cooperative, the while loop will not cause busy wait because we yield inside the loop each time. The only way to proceed is if another fiber unlocks the mutex and wakes us. This is correct.

4.2 Condition Variables

Condition variables allow fibers to wait for a condition. The typical API: wait(mutex, cond) releases the mutex, sleeps, and upon wake reacquires the mutex. signal(cond) wakes one waiter. broadcast(cond) wakes all.

Implementing a condition variable is straightforward: a queue of blocked fibers. wait adds the current fiber to the cond’s queue, releases the mutex (unlock), yields, then after wake, locks the mutex again. signal moves one fiber from the cond queue to the ready queue. Since we have a cooperative environment, there are no spurious wakeups (unless we allow interrupts, which we don’t).

We’ll need to modify the fiber scheduler to handle blocked fibers that are not in the mutex waiting queue but in a cond waiting queue. We can reuse the same blocking mechanism.

Implementation sketch:

typedef struct fiber_cond {
    fiber_t *waiting_head;
    fiber_t *waiting_tail;
} fiber_cond_t;

void fiber_cond_init(fiber_cond_t *cond) {
    cond->waiting_head = cond->waiting_tail = NULL;
}

void fiber_cond_wait(fiber_cond_t *cond, fiber_mutex_t *mutex) {
    // Release mutex
    fiber_mutex_unlock(mutex);
    // Block on condition
    fiber_t *fib = current;
    fib->state = BLOCKED;
    fib->next = NULL;
    if (cond->waiting_tail) {
        cond->waiting_tail->next = fib;
        cond->waiting_tail = fib;
    } else {
        cond->waiting_head = cond->waiting_tail = fib;
    }
    fiber_yield();
    // Reacquire mutex
    fiber_mutex_lock(mutex);
}

void fiber_cond_signal(fiber_cond_t *cond) {
    if (cond->waiting_head) {
        fiber_t *fib = cond->waiting_head;
        cond->waiting_head = fib->next;
        if (cond->waiting_head == NULL) cond->waiting_tail = NULL;
        fib->state = READY;
        fiber_enqueue(fib);
    }
}

void fiber_cond_broadcast(fiber_cond_t *cond) {
    while (cond->waiting_head) {
        fiber_cond_signal(cond); // dequeues one each time
    }
}

Now we have a complete set of synchronization primitives for cooperative user-level threading.

Part 5: Dealing with Blocking I/O

The Achilles’ heel of cooperative threading is blocking I/O. If a fiber calls read() on a socket that has no data, the kernel will block the entire process (or the kernel thread). This blocks all fibers. To handle this, we need to use non-blocking I/O combined with an event loop (select/poll/epoll). The fiber library must integrate with an I/O multiplexer. When a fiber issues a read that would block, it registers a callback with the event loop and yields. When the file descriptor becomes ready, the event loop wakes up the fiber.

This is exactly what modern libraries like libuv (used in Node.js) and Go’s runtime do. For our minimal library, we can provide wrappers for read, write, etc., that use non-blocking I/O and the fiber scheduler.

We would need to maintain a map from file descriptors to (fiber, operation). The event loop would be the scheduler’s main loop, using epoll_wait or select. The scheduler would interleave fiber execution with I/O polling. This is a significant extension but well beyond the scope of a minimal library. However, for educational completeness, we can outline the mechanism:

Before starting the scheduler, create an epoll fd.
Each fiber that wants to do I/O uses a function fiber_read(fd, buf, len) which first tries a non-blocking read. If it returns EAGAIN, the fiber registers itself with the epoll for that fd and yields. Later, when epoll_wait in the scheduler loop detects the fd, it moves the fiber back to ready queue.
The scheduler loop becomes: while (ready_queue not empty OR epoll has events) { run ready fibers; poll epoll; wake fibers whose fds are ready; }

This is the basis of asynchronous I/O driven by fibers.

Part 6: Performance and Real-World Considerations

6.1 How Fast Is It?

We can benchmark our library. On a modern x86_64 CPU, a round-trip yield (two context switches) using our custom assembly should be around 30–50 nanoseconds, compared to a kernel thread switch that might be 200–2000 ns. This speed is critical for applications handling millions of small tasks, such as network packet processing or server-side event handling.

6.2 Memory Overhead

Each kernel thread typically reserves 1–8 MB of stack space (overcommitted). Our fibers can use 4 KB stacks – that’s a 256x reduction in memory footprint. With 4 KB stacks, we can theoretically run over 250,000 fibers in 1 GB of RAM, whereas kernel threads would be limited to a few thousand.

6.3 Stack Overflow

Small stacks risk overflow. We could implement stack guards (guard pages) using mmap with PROT_NONE at the bottom of each stack, but that adds overhead. Alternatively, we can dynamically grow stacks similar to Go’s goroutines, which start with small stacks and expand as needed (using stack copying and pointer adjustment). That is complex but eliminates overflow.

6.4 Work Stealing and Multicore

Our library runs all fibers on a single kernel thread. To use multiple cores, we need multiple schedulers (each on its own kernel thread) with a work-stealing mechanism. This is how Go’s runtime works: it has a per-P (processor) run queue, and when a P is idle, it steals work from others. Building such a system adds another layer of complexity but is essential for performance on modern hardware.

6.5 Integration with Existing Code

User-level threads do not automatically make use of kernel threads for blocking calls. If you call into C libraries that do blocking I/O (e.g., database drivers), you must ensure they either use non-blocking I/O or run on separate kernel threads. Otherwise, you lose the benefit.

Part 7: Conclusion – The Illusion Made Real

We started with a street performer juggling three balls, and we ended with a fully functional user-level threading library. Along the way, we peeled back the layers of abstraction: we saw the true cost of kernel threads, we wrote assembly to swap register contexts, and we built a scheduler that coordinates thousands of lightweight tasks within a single kernel thread. We added mutexes and condition variables, and we briefly touched on the integration with asynchronous I/O.

User-level threading is not a relic of ancient operating systems—it is a vibrant technique used in modern high-performance systems. Go, Erlang, Haskell, Rust (async runtimes), and even C++ (boost.fiber, libco) rely on similar concepts. The illusion of many tasks running concurrently is maintained not by magic, but by careful orchestration of registers and stacks, all in user space.

If you’ve followed along and written the code, you now own a piece of that magic. You understand that concurrency, at its core, is just a sequence of fast hand-offs. And you have the power to craft your own juggling routine, finely tuned to the needs of your application.

Author’s Note: The complete source code for the library described in this article is available on GitHub at [example link]. It includes the assembly context switch, the scheduler, and the synchronization primitives. I encourage you to experiment with it—extend it to support multi-core, add a work-stealing scheduler, or integrate it with libevent. The journey of building your own concurrency machinery is both humbling and empowering.