Designing A Task Based Parallelism Runtime: Openmp Tasks, Dependencies, And Scheduling Algorithm

2022-11-12 · Leonardo Benicio

Summary

A comprehensive technical exploration of designing a task based parallelism runtime: openmp tasks, dependencies, and scheduling algorithm, covering key concepts, practical implementations, and real-world applications.

Contents

Here is a comprehensive, expanded technical blog post on designing a task-based parallelism runtime, building on the introduction you provided. I have extended it with the depth, examples, and rigor required for a full deep dive.

Introduction: The Death of the Free Lunch and the Rise of the Task

Imagine you are the lead engineer for a critical software system—perhaps the real-time path-planning module for a fleet of autonomous delivery drones, or the core rendering engine for the next blockbuster video game. Your system is stressed. Data is pouring in faster than ever. The clock is ticking. You look at your profiler and see a single, terrifying bottleneck: a monolithic loop processing a massive, irregular data structure. The solution, traditionally, would have been simple: wait for a faster CPU. But that free lunch ended two decades ago. Moore’s Law, as we knew it, stopped giving us faster single cores and started giving us more of them. Your quad-core machine is now a 16-core workstation, and your production server is a 128-core behemoth. Your single-threaded bottleneck doesn’t just run slowly; it actively wastes the vast majority of the hardware’s potential.

This is the fundamental crisis of modern software engineering. We are surrounded by parallel hardware, yet a staggering amount of code remains stubbornly, wastefully sequential. The promise of automatic parallelization—where a magical compiler transforms your sequential C++ into a perfectly balanced parallel program—has largely failed for anything beyond simple numeric loops. The complexity of modern software, with its pointer chasing, irregular control flow, and dynamic data structures, defies static analysis. The burden has fallen on the programmer, who is now forced to become a part-time systems engineer, wrestling with the sheer terror of threads, mutexes, condition variables, and the soul-crushing reality of deadlocks, data races, and non-deterministic Heisenbugs.

We have tried to build better mental models. Thread pools abstract away the creation and destruction of OS threads. Fork-join frameworks like OpenMP give us pragmas. Yet these tools feel like stone knives compared to the required surgical precision. They fail to capture the true structure of computation: a dynamic, irregular, unpredictable graph of work. This failure leads to load imbalance, serialization bottlenecks, and ultimately, waste.

The solution is the Task.

We are entering the era of the task-based parallelism runtime. Instead of wrangling threads, we ask the computer to consider our workload as a collection of discrete, composable units of work—tasks. We define the dependencies. We let a sophisticated runtime scheduler handle the mapping of millions of tiny tasks onto hundreds of hardware threads. The runtime steals work from idle cores. It manages cache locality. It handles the terrifying complexity of load balancing so you don’t have to.

Yet, for all its conceptual elegance, the path from the death of the free lunch to a robust, production-grade task runtime is paved with devilish details, subtle memory ordering races, and profound architectural trade-offs. If you are a systems engineer, a performance enthusiast, or just someone tired of watching your profiler report single-core usage on a 64-core machine, this deep dive is for you. We are going to build one.

Section II: Why Threads Failed Us – The Cathedral of Pain

Before we design the future, we must fully understand why the past is so broken. The direct use of OS threads (pthreads, Win32 threads, std::thread) is the foundational mistake that task runtimes correct.

The Weight of a Thread

An OS thread is not a lightweight entity. It is a kernel-managed scheduling unit. When you spawn a thread:

Kernel Call: The clone() or CreateThread() syscall is invoked.
Stack Allocation: A large contiguous stack region (typically 1–8 MB) is reserved. This is a massive virtual memory operation, even if backed lazily.
Kernel Data Structures: The OS creates thread control blocks, scheduling entries, and TLB context.
Scheduling Overhead: Every 4ms (default HZ=1000 scheduling tick) or upon blocking, the kernel makes a decision about which thread runs next.

The cost of creating a thread is on the order of microseconds (10,000+ ns). The cost of a context switch (switching between threads on the same core) is roughly 1–5 microseconds of wasted cycles. You cannot afford this for fine-grained work. A matrix multiply with a million elements cannot spawn a million threads.

The Scalability Ceiling (Amdahl’s Law Revisited)

Amdahl’s Law states that the speedup of a program is limited by the time fraction that must be executed sequentially. [ \text{Speedup}(P) = \frac{1}{(1 - F) + F/P} ] Where (F) is the fraction of work that is parallelizable, and (P) is the number of cores.

If your program is 95% parallel ((F=0.95)), increasing cores looks good initially:

4 cores: Speedup of 3.5x
16 cores: Speedup of 10.6x
64 cores: Speedup of 20.1x

You hit a wall. On a 64-core machine, you are using 95% of the cores, yet your speedup is only 20x. The 5% sequential bottleneck dominates.

Now consider the synchronization tax. Every mutex lock, every atomic operation adds to the sequential fraction. If your parallel region requires acquiring a single global lock (e.g., a shared queue), that lock becomes the bottleneck. Let’s modify Amdahl’s Law:

[ \text{Speedup}(P) = \frac{1}{(1 - F) + (F / P) + ( \text{Lock Overhead} \times P )} ]

If the lock contention grows linearly with the number of cores (which it often does), adding more cores eventually makes the program slower. This is the “Mutex Wall.”

The Debugging Hell (Heisenbugs)

Data races are undefined behavior in C and C++. The compiler is allowed to assume a race-free program. If you have a race, the program can:

Work on Linux debug.
Fail on Windows release.
Work in the debugger (Heisenbug effect) because the debugger changes timing.
Fail only under heavy load.

Traditional debugging is a loop of: Hypothesis -> Edit -> Compile -> Run -> Observe. With parallel bugs, the loop becomes: Guess -> Pray -> Compile -> Run on 128 cores -> Watch it crash randomly -> Curse the gods.

The fundamental issue is that threads force a 1:1 mapping between logical units of work and OS resources. We need an M:N model. Many millions of logical tasks (M) mapped onto a small number of OS threads (N). This is the holy grail that task runtimes provide. The runtime acts as a regulator, keeping exactly as many threads busy as there are hardware cores, while the programmer expresses the logical parallelism freely.

Why Thread Pools Are Not Enough

A thread pool (say, a fixed pool of std::thread reading from a std::queue) solves the creation overhead problem. But it introduces a terrible scalability bottleneck: the global queue lock.

Worker 1 wants a task. Locks queue.
Worker 2 wants a task. Waits on lock.
Worker 3 wants a task. Waits.
Worker 4 pushes a task. Waits on lock.

On a 64-core machine, this is a disaster. The cache line holding the mutex bounces between every core (cache coherency traffic). The lock lottery decides who works next.

We need a distributed, non-blocking scheduling strategy. We need Work Stealing.

Section III: The Task Abstraction – The Quantum of Computation

Let us define our core concept.

A Task is a closure. It is a function (functor, lambda) plus its captured state. In a runtime, a task is typically a struct or class with:

An executable run() method.
A set of dependencies (other tasks that must complete before this can start).
A continuation (what to run next).
A counter (reference count, dependency counter).

Crucially, a task is lightweight. It is not a thread. It has no kernel stack. It might have a small user-space stack (stackful coroutines like Go or Boost.Fiber) or no stack at all (stackless coroutines like C++20 co_await or Rust async).

The Task Dependency Graph (DAG)

Your program is no longer a sequential list of instructions. It is a Directed Acyclic Graph (DAG).

Nodes: Tasks.
Edges: Dependencies (A is a child of B, B cannot start until A finishes).

This is a powerful data structure. It is the blueprint for parallelism. If two tasks have no path between them in the DAG, they can be executed in parallel. If they have a path, they must be serialized.

The Work / Span Model (Cilk Model)

To reason about performance, computer scientists developed the Cilk model, which is the gold standard for task scheduling analysis.

Work ((T_1)): The total time to execute all tasks on a single core. This is the serial execution time.
Span ((T_\infty)): The length of the critical path. This is the minimum time required to execute the program on an infinite number of cores. (The longest chain of dependencies).

A classic result is the Cilk speedup bound: [ TP \le T_1/P + T\infty ]

This is much better than Amdahl’s Law for most cases! It says that if you have enough parallelism ((T1) is much larger than (T\infty)), you can achieve near-linear speedup.

Example: Parallel QuickSort

# Sequential partition, parallel recursion
def quicksort(A, low, high):
    if low < high:
        pivot = partition(A, low, high) # Sequential step
        # TASK: Sort left
        spawn quicksort(A, low, pivot - 1)
        # TASK: Sort right
        spawn quicksort(A, pivot + 1, high)
        # Wait for both tasks
        sync

The DAG looks like a binary tree. The Work (T1) is (O(N \log N)). The Span (T\infty) is the height of the tree, (O(\log N * \text{(cost of partition)})).

If (N = 1,000,000) and (\log N = 20), the parallelism ((T1 / T\infty)) is roughly (N / \log N = 50,000). On a 128-core machine, we have enormous amounts of parallelism to exploit. The runtime has plenty of work to keep every core busy.

Granularity: The Price of a Task

What is the overhead of spawning a task?

Thread spawn: ~10,000 ns (too high).
Task spawn: ~10-100 ns (in C++ with a lock-free queue).

If your task does 1 nanosecond of work, the spawn overhead is 10000%. This is too fine-grained. You must break work into chunks large enough to amortize the overhead. A good rule of thumb is that a task should do at least 1 microsecond of work.

Cilk uses “stubs,” “continuations,” and “inlining.”
Rayon uses “par_iter” which chunks arrays automatically.
Intel TBB uses a “range” splitting approach.

The runtime must be aware of granularity and automatically cut off splitting when the task size is too small (a sequential cut-off).

Section IV: The Heart of the Machine – Work Stealing

This is the crown jewel of task scheduling. It is the mechanism that allows us to achieve the theoretical Cilk bound in practice.

The Concept

Every worker thread (OS thread) owns a double-ended queue (deque) of tasks assigned to it.

Private Work (LIFO): When a worker spawns a task or pushes a new task, it pushes it to the bottom of its own deque. When it is idle, it pops a task from the bottom of its own deque. This is Last-In-First-Out. Why LIFO? Cache Locality. The most recently spawned task is likely working on data still hot in the L1 cache.
Theft Work (FIFO): If a worker’s own deque is empty, it becomes a thief. It randomly selects a victim worker and tries to steal a task from the top of the victim’s deque. This is First-In-First-Out. Why FIFO? The oldest tasks at the top are the largest “chunks” of work. Stealing large chunks reduces the number of steals needed and improves load balance.

The Chase-Lev Deque

This is the classic lock-free data structure that implements Work Stealing. It must be thread-safe (pop and push by the owner; steal by anyone) without heavy locks. We will design a simplified version. It is a cyclic buffer.

template<typename T>
class WorkStealingDeque {
    // Aligned to prevent false sharing!
    alignas(64) std::atomic<int64_t> top;   // Index of the top element (accessed by thieves).
    alignas(64) std::atomic<int64_t> bottom; // Index of the first *free* slot (accessed by owner).
    std::vector<std::atomic<T*>> buffer;    // The ring buffer.
    int capacity;

public:
    WorkStealingDeque(int cap) : top(0), bottom(0), capacity(cap), buffer(cap) {}

    void push(T* task) {
        int64_t b = bottom.load(std::memory_order_relaxed);
        int64_t t = top.load(std::memory_order_acquire);
        int64_t size = b - t;

        // Check for resize (simplified, usually grows exponentially)
        if (size >= capacity - 1) {
            // Resize logic (locks, not shown for brevity)
        }

        buffer[b % capacity].store(task, std::memory_order_relaxed);
        // Release: ensures the write to buffer is visible before bottom update.
        bottom.store(b + 1, std::memory_order_release);
    }

    T* try_pop() {
        // Owner tries to pop from the bottom.
        int64_t b = bottom.load(std::memory_order_relaxed) - 1;
        bottom.store(b, std::memory_order_relaxed); // Tentative decrement.
        // Full memory barrier to prevent reordering with the load of top.
        std::atomic_thread_fence(std::memory_order_seq_cst);

        int64_t t = top.load(std::memory_order_relaxed);
        T* task = nullptr;

        if (t <= b) {
            // Queue is non-empty (or was when we started).
            task = buffer[b % capacity].load(std::memory_order_relaxed);
            // Check if this was the last element (potential race with steal).
            if (t == b) {
                // The deque became empty. We must handle the race.
                // This is the "Danger Zone".
                // If a thief concurrently stole the *only* element, we must not take it.
                // We CAS top from t to t+1. If it fails, the thief won.
                if (!top.compare_exchange_strong(t, t + 1,
                                                 std::memory_order_acq_rel,
                                                 std::memory_order_relaxed)) {
                    // The thief won. Our pop is invalid.
                    task = nullptr;
                }
                // Regardless, the deque is now empty. Set bottom = top.
                bottom.store(t + 1, std::memory_order_release);
            }
        } else {
            // Queue was already empty before our pop.
            bottom.store(b + 1, std::memory_order_relaxed); // Restore bottom.
            task = nullptr;
        }
        return task;
    }

    T* try_steal() {
        // Thief tries to steal from the top.
        int64_t t = top.load(std::memory_order_acquire);
        // Acquire load for bottom: we must see the latest bottom.
        int64_t b = bottom.load(std::memory_order_acquire);

        T* task = nullptr;
        if (t < b) {
            // Non-empty queue.
            task = buffer[t % capacity].load(std::memory_order_relaxed);
            // CAS to try to reserve this element.
            // "Can I take the task at index t?"
            if (!top.compare_exchange_strong(t, t + 1,
                                             std::memory_order_acq_rel,
                                             std::memory_order_relaxed)) {
                // Failed: another thief got it or owner grabbed it.
                return nullptr;
            }
            // Success! We own this task.
        }
        return task;
    }
};

The “Danger Zone” Explain

Why is the seq_cst fence in try_pop so important?

Imagine a deque with exactly one element.

Owner (Pop): bottom = 1. It decrements to 0. b=0, t=0.
Thief (Steal): It reads t=0, b=1 (before owner decremented!). It thinks the queue is non-empty.
Race: The thief tries to CAS top from 0 to 1. The owner sees t == b (both 0), and tries to CAS top from 0 to 1.

Without the seq_cst fence in pop, the write to top in the thief and the write in the owner might be reordered by the CPU. The seq_cst fence forces a total order on these operations. It ensures that the thief’s load of bottom and the owner’s store to bottom are properly synchronized. If the fence is not seq_cst, the thief might see a stale bottom (old value 1) and access an invalid element, or the owner might think the queue is empty when it isn’t.

The ABA Problem

The indices top and bottom wrap around the ring buffer. A thief might read top = 0, get descheduled, wake up, and see top = 0 again, but the elements have completely changed! This is the classic ABA problem.

Solution: Tagged Pointers / 64-bit Tags. We store top and bottom as 64-bit integers. The lower bits are the index into the buffer. The high bits are a tag. Every time top is modified, the tag increments. This means that even if the index wraps around, the tag is different. Our compare_exchange_strong on top now checks the tag. If another steal modified the tag, our CAS fails.

struct alignas(64) DequeState {
    std::atomic<uint64_t> top; // [Tag:32 | Index:32]
    // ...
};
uint64_t top_val = top.load();
uint64_t new_top = (top_val + (1ULL << 32)) + 1; // Increment tag by 1, increment index by 1.
top.compare_exchange_strong(top_val, new_top);

Victim Selection

How does a thief choose whom to steal from?

Random: Cilk and TBB use random or pseudo-random victim selection. The theory says this is optimal for load balancing in the average case. It is simple and decentralized.
Hierarchical (NUMA-aware): A thief first steals from workers on its own NUMA node. If none are available, it steals from a remote node. This is critical for performance because remote memory access (QPI/UPI link) is much slower than local memory.

Work-First vs. Help-First Scheduling

Work-First (Cilk): The parent task continues immediately after spawning a child. The child is “stolen” by an idle worker. This has the lowest overhead for the worker, but creates more “steal” opportunities.
Help-First (TBB, Rayon): The parent task is “pushed” onto the deque. The child runs immediately. Idle workers steal the parent (or other siblings). This is often better for cache behavior.

Most modern runtimes use a hybrid approach.

Section V: Building the Runtime – A Skeleton

Let’s piece together a minimal C++ runtime. This is heavily simplified, but captures the essence.

#include <atomic>
#include <thread>
#include <vector>
#include <functional>
#include <random>

class Task {
public:
    std::atomic<uint32_t> ref_count{1};
    Task* parent = nullptr;
    virtual void execute() = 0;
    virtual ~Task() = default;
};

class LambdaTask : public Task {
    std::function<void()> fn;
public:
    LambdaTask(std::function<void()> f) : fn(std::move(f)) {}
    void execute() override { fn(); }
};

class Scheduler {
    struct WorkerContext {
        WorkStealingDeque<Task> deque;
        std::mt19937 rng;
        WorkerContext() : deque(256), rng(std::random_device{}()) {}
    };
    std::vector<std::thread> workers;
    std::vector<std::unique_ptr<WorkerContext>> contexts;
    std::atomic<bool> done{false};
    int thread_count;

public:
    Scheduler(int num_threads = std::thread::hardware_concurrency())
        : thread_count(num_threads) {
        for (int i = 0; i < num_threads; ++i)
            contexts.push_back(std::make_unique<WorkerContext>());

        for (int i = 0; i < num_threads; ++i) {
            workers.emplace_back([this, i] { worker_loop(i); });
        }
    }

    ~Scheduler() {
        done.store(true, std::memory_order_release);
        for (auto& w : workers) w.join();
    }

    int get_thread_id() {
        // Thread-local storage for thread id.
        static thread_local int tls_id = -1;
        return tls_id;
    }

    void spawn(Task* t) {
        int tid = get_thread_id();
        contexts[tid]->deque.push(t);
    }

    void spawn_detached(std::function<void()> fn) {
        auto* task = new LambdaTask(std::move(fn));
        spawn(task);
    }

    // Spawn a child task. The parent will wait for it.
    void spawn_child(Task* parent, Task* child) {
        child->parent = parent;
        parent->ref_count.fetch_add(1, std::memory_order_acq_rel);
        spawn(child);
    }

    void wait_for_task(Task* task) {
        // Work-stealing wait!
        // Instead of blocking the thread, we help the runtime.
        while (task->ref_count.load(std::memory_order_acquire) != 0) {
            int tid = get_thread_id();
            Task* work = contexts[tid]->deque.try_pop();
            if (work) {
                work->execute();
                complete(work);
            } else {
                // Steal from a random victim.
                std::uniform_int_distribution<int> dist(0, thread_count - 1);
                int victim = dist(contexts[tid]->rng);
                if (victim == tid) continue;

                Task* stolen = contexts[victim]->deque.try_steal();
                if (stolen) {
                    stolen->execute();
                    complete(stolen);
                }
                // If no work, yield to OS (or spin briefly).
                std::this_thread::yield();
            }
        }
    }

    void complete(Task* task) {
        if (task->parent) {
            // Decrement parent ref count.
            // If it reaches 0, the parent is unblocked (schedule it).
            if (task->parent->ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1) {
                // Parent is ready to run again.
                spawn(task->parent);
            }
        }
        delete task; // Danger: memory management simplified.
    }

    // Sync point for a scope!
    struct Scope {
        Scheduler& sched;
        std::atomic<int> counter{1}; // Start at 1 to prevent premature exit.

        void spawn_child_task(Task* t) {
            counter.fetch_add(1, std::memory_order_acq_rel);
            sched.spawn_child(this, t); // This is wrong, parent is Scope, but simplified.
            // In reality, Scope has a list of parents or tasks.
        }
        void wait() {
            // Works like a barrier.
        }
    };

    void worker_loop(int tid) {
        // Set thread-local id.
        // (Requires compiler-specific thread_local).
        // For brevity, assume we have a thread_local int my_id="i9616b99"
        while (!done.load(std::memory_order_acquire)) {
            Task* task = contexts[tid]->deque.try_pop();
            if (task) {
                task->execute();
                complete(task);
            } else {
                // Steal...
                // ...
                if (!task) {
                    // Spin a bit, then yield.
                    std::this_thread::yield();
                }
            }
        }
    }
};

// Usage Example
void parallel_example(Scheduler& sched) {
    // Create a root task.
    auto* root = new LambdaTask([]{
        // This is the root of the DAG.
    });
    // Spawn children...
    sched.wait_for_task(root);
}

The complete() Problem

Imagine a binary tree workload like QuickSort. If we spawn tasks A, B, C…

A creates B and C.
B finishes. It decrements A’s counter.
C finishes. It decrements A’s counter.
When A’s counter hits 0, we re-insert A into the deque.

What if A finishes very late? The runtime needs a mechanism to handle the continuation efficiently. Cilk’s solution was to invert the task. Instead of the parent waiting for the child, the child, upon completion, continues the parent. This is “continuation stealing.”

Continuations (The Real Key)

Continuations are how you write non-blocking task code.

// Instead of:
future<int> f1 = spawn(do_work_a());
future<int> f2 = spawn(do_work_b(f1.get()));

// We write:
spawn(do_work_a())
    .then([](int result_a) { return do_work_b(result_a); })
    .then([](int result_b) { std::cout << result_b; });

When do_work_a completes, it doesn’t block. It schedules the continuation task (do_work_b) onto the runtime. This is the foundation of modern async runtimes (Tokio, std::execution, Swift Actors).

Section VI: Advanced Considerations – The Devil’s Workshop

NUMA Awareness

The Non-Uniform Memory Access architecture is the silent killer of performance. A thread on Socket 0 accessing memory allocated on Socket 1 pays a heavy penalty (latency and bandwidth).

Memory Allocation: Use libnuma to allocate memory on the same node as the worker. This is often called First-Touch Policy. The thread that first touches a page of memory will have it allocated on its local node.
Work Stealing: A worker should first try to steal from workers on its own NUMA node. Only if local queues are empty should it steal globally. This prevents threads from “migrating” their working set across the memory bus.
Task Affinity: Tag a task with its preferred NUMA node. The runtime attempts to execute it there.

I/O Integration (The Blocking Problem)

A task runtime is magical for CPU-bound work. But what about I/O?

Synchronous I/O (read/write): A thread calls read(). The thread blocks. The kernel deschedules it. The runtime has lost a core.
Blocking Mutex: A worker tries to lock a mutex. It spins, fails, sleeps. Another core is wasted.

The solution is Asynchronous I/O and Non-blocking synchronization.

Linux io_uring: The runtime submits a batch of I/O requests to the kernel. A dedicated thread or the runtime poller checks for completions. When a completion arrives, the runtime schedules the continuation task.
Tokio’s Model: A reactor runs the event loop (epoll/kqueue). It wakes up workers when I/O is ready. The runtime never blocks on I/O.
Go’s Netpoller: The Go runtime has a dedicated “network poller” thread that does the same thing. If a goroutine blocks on I/O, it is detached from the OS thread, and the thread picks up a new goroutine to run.

Debugging and Profiling Task Runtimes

You cannot gdb a task runtime. Debugging a race condition is nearly impossible. Instead, we use Tracing.

Tracy: A frame-based profiler that records every task start, stop, and lock event.
Chrome Tracing format: Output a JSON file of events. “Thread X executed Task Y from time A to time B.”
Intel VTune / AMD uProf: These tools understand parallel regions and can show you time spent in the scheduler vs. useful work.

Pitfalls

False Sharing: Two variables used by different threads on different cores reside on the same cache line. The cache line bounces back and forth. Solution: Pad your hot data structures to 64 bytes (alignas(64)).
Memory Allocation: The default malloc is a global lock. In a parallel runtime, this is a crippling bottleneck. Solution: Use jemalloc, tcmalloc, or a custom per-thread pool.
Generating Too Many Tasks: If you spawn a task for every element of an array, the overhead will dominate. Solution: Use a “stolen” cut-off or a library like Rayon that automatically divides work into chunks.
Blocking on a Mutex Inside a Task: This reduces the runtime’s M:N scheduling back to 1:1. If a worker must wait for a lock, the latency of the lock dominates.
- Solution: Use lock-free data structures. Or, use an async mutex (a mutex that yields the thread to other tasks while waiting).
Priority Inversion: A low-priority task holds a lock needed by a high-priority task. The high-priority task cannot run. Solution: Avoid explicit priorities in a uniform task runtime, or implement priority inheritance.

Section VII: Real-World Ecosystems – The Titans

Let’s look at how different platforms solve the same problem.

Feature	Intel TBB (C++)	Rayon (Rust)	Tokio (Rust)	Goroutines (Go)
Model	Task-based (Fork-Join)	Data Parallelism	Async I/O + Tasks	M:N User Threads
Scheduler	Work Stealing (Global + Local)	Work Stealing	Work Stealing (Tokio)	Work Stealing (GMP)
I/O	No built-in I/O integration	N/A (for data parallel)	Fully integrated (epoll/io_uring)	Fully integrated (netpoller)
Dependencies	Flow graph / `parallel_for`	`par_iter().map().filter()`	Manual via `select!` / channels	Channels (`chan`)
Blocking a Task	Spins / Blocks	N/A	Yields to reactor	Detaches P (Process) from M (OS Thread)
Stack	Flat (no stack)	Flat (no stack)	Flat (no stack)	Growable stack (2KB initial)
Key Strength	Generic, performance on CPU	Ergonomic data parallelism	Async I/O at scale	Simplicity of blocking I/O

Deep Dive: Goroutines vs. Tokio

This is a fascinating debate.

Go: “Don’t communicate by sharing memory; share memory by communicating.” A blocked goroutine (waiting on a channel, waiting on I/O, waiting on a syscall) is parked. The OS thread picks up another goroutine. This is incredibly ergonomic. The programmer writes synchronous-looking code that is actually asynchronous.
- Weakness: A goroutine has a stack (2KB, growable). The GC must handle this.
Tokio (Rust): “Zero-cost abstractions.” Tasks are just Future objects. They are state machines. They have no stack at all. Blocking is strictly forbidden.
- Weakness: The programmer must explicitly think about async/await. Everything has a “color” (sync vs. async). Dropping a future is cancelling it.

Deep Dive: C++20 Executors (P2300)

The std::execution proposal (Senders/Receivers) is the most ambitious attempt to standardize a task-based runtime for C++.

// Compose an async operation!
sender auto sndr = execution::just(42)
    | execution::then([](int x) { return x + 1; })
    | execution::let_value([](int x) {
          return execution::just(std::to_string(x));
      });
// Execute on a thread pool
auto result = std::this_thread::sync_wait(std::move(sndr));

This is powerful. You compose a DAG of operations dynamically. The runtime handles the scheduling and execution. It is the future of C++ parallelism.

Section VIII: Conclusion – The Future is a DAG

We have come a long way from the “Death of the Free Lunch.” We moved from raw OS threads, through thread pools, to the elegant abstraction of the Task.

A task-based runtime is not just a “better thread pool.” It is a fundamental shift in how we reason about computation.

We think in graphs, not lists.
We think in dependencies, not locks.
We let the scheduler handle load balancing, because it is provably better at it than any human.

The future of task runtimes is bright.

Heterogeneous Computing: We will see runtimes that manage CPUs, GPUs, and NPUs, transparently moving data and tasks between them (SYCL, Kokkos, Raja).
Resilience: Tasks will be retried on failure, migrated from failing nodes.
Composability: The C++ Executors model allows different schedulers (a GPU scheduler, a thread pool, a single-threaded executor) to be composed together in the same program through a standard API.
Compiler Integration: Compilers (like LLVM) are getting better at automatically generating task DAGs from sequential code using static analysis and polyhedral models.

The free lunch of faster CPUs is long gone. The free lunch of automatic parallelism was a myth. The only real meal on the table is the one we engineer ourselves. The task-based runtime is the knife and fork. Learn to wield it well, because the future of software depends on our ability to think in parallel. Go forth and schedule.