The Formal Verification Of A Concurrent Work Queue Using The Tla+ Model Checker: Invariants And Liveness

2020-08-10 · Leonardo Benicio
Summary

A comprehensive technical exploration of the formal verification of a concurrent work queue using the tla+ model checker: invariants and liveness, covering key concepts, practical implementations, and real-world applications.

Contents

Beyond the Bug: Formal Verification of a Concurrent Work Queue with TLA+

Introduction (Expanded)

It was 3:47 AM. The on-call phone buzzed on the nightstand, dragging me from a deep sleep into that familiar haze of adrenaline and dread. A critical production service was down. The symptom was maddeningly vague: a “work queue stall.” Consumers had stopped processing tasks. They weren’t crashing; they were simply waiting. The logs were clean. The alerts were silent. The queue had tasks, but the workers were asleep.

After two hours of frantic debugging, we found it. A single thread, deep within the queue’s internal state management, had returned a null head pointer when it shouldn’t have. It was a classic “check‑then‑act” race condition, masked by the 99.9th percentile of latency. It happened precisely when a queue drain coincided with a concurrent producer push. We fixed it with a single synchronized block. We wrote a regression test. We deployed. The phone stayed quiet.

But I never slept well again. That single line fix haunted me. How many other races were lurking in that codebase, waiting for the exact constellation of thread interleavings to manifest? How could I prove, with absolute certainty, that the queue behavior was correct in every possible scenario? Unit tests offered a flimsy safety net. Stress tests revealed only what they could reproduce. I needed a better oracle.

This is the fundamental crisis of concurrent software. We, as engineers, are expected to build systems that juggle a million actions a second, yet we validate them using tools designed for a single‑threaded world. Race conditions, deadlocks, and livelocks are the Schrödinger’s cat of software engineering: they exist until observed, and they are notoriously difficult to observe until it is too late. The cost of failure is not just a 3 AM page; in distributed systems, it is data corruption, financial loss, and erosion of trust.

In this post, I will walk you through how I used TLA⁺ — a formal specification language created by Leslie Lamport — to model, verify, and ultimately prove the correctness of a concurrent work queue. We’ll go far beyond “write a test and pray.” You will learn:

  • Why traditional testing is insufficient for concurrent systems.
  • How to model a concurrent work queue in TLA⁺ (including the queue’s internal state, producers, consumers, and a drain operation).
  • How to specify safety and liveness properties.
  • How to run model checking with TLC to exhaustively explore all possible interleavings.
  • How to refine your model step by step until it matches your implementation.
  • Real‑world pitfalls we discovered during the process.
  • How to translate a verified TLA⁺ specification into production‑grade code.

By the end, you’ll understand why formal verification is not an esoteric academic exercise but a practical, life‑saving tool for any engineer building concurrent or distributed systems. Let’s begin.

1. The Anatomy of a Concurrent Work Queue

Before we dive into formal logic, let’s define exactly what we mean by a “concurrent work queue.” In a typical microservices or background‑job system, you have:

  • Producers: threads that enqueue tasks.
  • Consumers: threads that dequeue and process tasks.
  • A central queue data structure, often backed by an array or linked list, that must be thread‑safe.

Additionally, there is often a drain operation: a way to remove all pending tasks (e.g., during graceful shutdown or a priority reset). The drain must be atomic with respect to concurrent enqueue and dequeue operations.

The queue we’ll model is a simplified version of what plagued our production system:

  • It supports three operations: Enqueue(task), Dequeue() → task | None, and Drain() → list of tasks.
  • Multiple producers and multiple consumers can call these concurrently.
  • Correctness means:
    1. Safety (nothing bad ever happens): no task is lost, no task is duplicated, drain returns exactly the set of tasks that were in the queue at the moment of drain, and no consumer ever receives null when the queue is non‑empty.
    2. Liveness (something good eventually happens): every enqueued task is eventually dequeued (unless drained) and drain eventually completes.

These properties are notoriously hard to test. A unit test can check one thread; a stress test can run many threads, but the timing of interleavings is non‑deterministic. A bug might only appear after hours of runtime or under a specific load pattern. Formal verification with TLA⁺ eliminates this uncertainty by exhaustively enumerating all possible behaviors of the system within a bounded state space.

2. Why TLA⁺?

TLA⁺ stands for Temporal Logic of Actions. It is a language for specifying systems — especially concurrent and distributed ones — in a way that can be mechanically checked. Unlike unit tests that sample a few execution paths, TLA⁺ models are checked by a model checker (TLC) that explores every reachable state. If a property holds, it holds for all possible executions. If it fails, TLC produces an explicit counterexample trace – a sequence of steps that leads to the violation. That trace is golden for debugging.

TLA⁺ is built on simple mathematical foundations: sets, functions, and actions (state transitions). You describe the system as a state machine, specify invariants and temporal properties, and let TLC do the heavy lifting.

2.1 A Quick Primer

A TLA⁺ specification has three main components:

  1. State variables – the mutable state of the system (e.g., the queue’s contents).
  2. Initial predicate Init – the starting state.
  3. Next‑state relation Next – a disjunction of actions, where each action defines how one or more variables can change.

A behavior is an infinite sequence of states where the first satisfies Init and each consecutive pair satisfies Next. Model checking explores all (finite prefixes of) behaviors up to some bound.

We also write invariants (properties that must be true in every state) and temporal formulas (properties that must be true over entire behaviors, such as “every task is eventually processed”).

TLC works by enumerating all reachable states. The number of states can explode, but for small‑to‑medium concurrent models (like a work queue with a few threads and a bounded number of tasks) it is tractable. The key is to keep the model abstract enough to capture the essence of concurrency while not over‑specifying implementation details.

3. Modeling a Concurrent Work Queue in TLA⁺

Let’s build our model step by step. We’ll start with a very abstract version and refine it.

3.1 First Model: A Simple Sequential Queue

We begin by modeling the queue as a sequence of tasks. We’ll use TLA⁺ standard module Sequences.

-------------------------- MODULE WorkQueue --------------------------
EXTENDS Integers, Sequences, TLC

CONSTANT TaskID         \* The set of possible task identifiers
VARIABLE queue          \* The sequence of pending tasks

Init == queue = << >>  \* Empty queue

Enqueue(task) ==
    queue' = Append(queue, task)

Dequeue ==
    IF queue = << >>
    THEN UNCHANGED queue
    ELSE queue' = Tail(queue)

Next ==
    \E task \in TaskID : Enqueue(task)   \* Producer actions
    \/ Dequeue                           \* Consumer actions

This is trivial: only one action at a time (no concurrency). Next is a disjunction of actions, but each action is atomic. To introduce concurrency we need to model multiple threads acting simultaneously. In TLA⁺ we do this by allowing multiple actions to be interleaved; the model checker will consider all possible orders.

3.2 Introducing Concurrency: Multiple Producers and Consumers

We can model each thread as a separate process. TLA⁺ has no built‑in process construct, but we can simulate with a variable that records which thread is “active” or we can use a set of actions labeled by thread ID. For clarity, we’ll use the Fairness module and write a separate Next action for each thread. However, the standard TLA⁺ way is to have a single Next that non‑deterministically picks any thread and any legal action.

Let’s define constants for the number of producers and consumers.

CONSTANTS
    Producers,
    Consumers,
    TaskID

VARIABLES
    queue,           \* Sequence of tasks
    pc,              \* Program counter for each thread (not needed yet)
    pendingTasks     \* A set of tasks that have been enqueued but not yet dequeued/drained

\* For simplicity we track only the queue sequence.
\* We'll add more variables later for correctness properties.

Now we model the actions. Each action is parameterized by the thread index.

Enqueue(producer) ==
    \E t \in TaskID :
        /\ queue' = Append(queue, t)

Dequeue(consumer) ==
    IF queue = << >>
    THEN UNCHANGED queue
    ELSE queue' = Tail(queue)

Drain ==
    queue' = << >>

But in a true concurrent setting, Drain must atomically capture the entire queue and clear it. Our Drain above is fine as an atomic action. The problem is that Enqueue, Dequeue, and Drain can interleave arbitrarily. We need to model the fact that multiple threads can attempt to act at the same time, but the system as a whole takes one action per step (interleaving semantics). That is already what TLA⁺ does: Next is a disjunction of all possible actions from all threads.

Next ==
    \E p \in 1..Producers : Enqueue(p)
    \/ \E c \in 1..Consumers : Dequeue(c)
    \/ Drain

Now we have a model where any producer can enqueue any task, any consumer can dequeue, and a drain can occur. But this is still too coarse: enqueue always succeeds, dequeue never blocks (it returns empty when queue empty), and drain just empties the queue regardless of state. We haven’t modeled the internal implementation – e.g., the queue might be implemented as a linked list with a head pointer that can cause null pointer dereference under concurrent access.

To discover the race condition we encountered, we need to refine the model to reflect a plausible implementation that is not atomic. The bug happened because the implementation had a non‑atomic read‑modify‑write of the head pointer. So we must model the internal data structure and the low‑level operations.

3.3 Modeling a Linked‑List Queue with Non‑Atomic Operations

Let’s assume the queue is implemented as a singly linked list with two pointers: head (pointing to the first node) and tail (pointing to the last node). head = NULL when the queue is empty. A node contains a task and a next pointer.

In TLA⁺ we can model memory as a function from node addresses to node records. But to keep the model small we can abstract: represent the queue as a sequence of tasks, but simulate non‑atomicity by splitting each operation into multiple micro‑steps. For example, Dequeue might involve:

  1. Read head.
  2. If head != NULL, read the task from that node.
  3. Write head = head.next.
  4. Return the task.

If a concurrent Drain happens after step 2 but before step 3, the Drain could set head = NULL and then the Dequeue writes head to the old head.next, effectively resurrecting a node that should have been drained – a classic race.

To model this, we need to introduce a pc (program counter) per thread and break actions into atomic steps.

Let’s create a more detailed specification.

CONSTANTS
    Producers,
    Consumers,
    Null              \* used as a sentinel

VARIABLES
    head,             \* pointer to first node or Null
    tail,             \* pointer to last node or Null
    nodes,            \* function from node address to [task |-> ..., next |-> ...]
    free,             \* set of available node addresses (for allocation)
    pc_p,             \* program counter for each producer
    pc_c              \* program counter for each consumer
    \* ... more state as needed

This quickly becomes complex. Instead, we can use a higher‑level but still non‑atomic model: we keep the queue as a sequence, but we treat each operation as a sequence of two “micro‑actions”: a read and a write. This abstraction captures the essence of the race without modeling every pointer.

For instance, consider a Dequeue micro‑step:

  • DequeueRead(consumer): reads the current state of queue and stores it in a local variable (implicitly).
  • DequeueWrite(consumer): modifies queue based on the previously read value.

If a Drain writes between the read and the write, the consumer might write based on stale data.

Let’s model:

VARIABLES
    queue,              \* sequence of tasks
    consumerBuffer[1..Consumers],  \* holds the value read by each consumer
    producerBuffer[1..Producers]   \* similar
    drainFlag

Init ==
    /\ queue = << >>
    /\ \A c \in 1..Consumers : consumerBuffer[c] = "idle"
    /\ \A p \in 1..Producers : producerBuffer[p] = "idle"
    /\ drainFlag = FALSE

\* Producer actions
ProducerRead(p) ==
    /\ producerBuffer[p] = "idle"
    /\ \E t \in TaskID :
        producerBuffer[p]' = t
    /\ UNCHANGED queue, consumerBuffer, drainFlag

ProducerWrite(p) ==
    /\ producerBuffer[p] # "idle"
    /\ queue' = Append(queue, producerBuffer[p])
    /\ producerBuffer[p]' = "idle"
    /\ UNCHANGED consumerBuffer, drainFlag

\* Consumer actions
ConsumerRead(c) ==
    /\ consumerBuffer[c] = "idle"
    /\ consumerBuffer[c]' = queue   \* take a snapshot
    /\ UNCHANGED queue, producerBuffer, drainFlag

ConsumerWrite(c) ==
    /\ consumerBuffer[c] # "idle"
    /\ consumerBuffer[c] # << >>   \* queue not empty in snapshot
    /\ queue' = Tail(consumerBuffer[c])   \* write the tail of the snapshot
    /\ consumerBuffer[c]' = "idle"
    /\ UNCHANGED producerBuffer, drainFlag

\* Drain action (atomic for now)
Drain ==
    /\ drainFlag = FALSE
    /\ queue' = << >>
    /\ drainFlag' = TRUE
    /\ UNCHANGED producerBuffer, consumerBuffer

Next ==
    \E p \in 1..Producers : ProducerRead(p) \/ ProducerWrite(p)
    \/ \E c \in 1..Consumers : ConsumerRead(c) \/ ConsumerWrite(c)
    \/ Drain

This model captures the essence: a consumer reads the entire queue, then later writes back Tail of that snapshot. If Drain happens between read and write, the write will set queue to the tail of a snapshot that no longer reflects the true state, potentially losing tasks that were added after the snapshot. Or if Drain clears the queue, then a consumer write that happens after Drain might revive a tail that includes tasks that should have been drained. This is exactly the kind of race we want to catch.

But note: our model has a bug already! In ConsumerWrite, we assumed the snapshot was non‑empty. What if the snapshot was empty? Then the consumer should not modify the queue. We need to allow a read of an empty queue to cause the consumer to do nothing. Let’s refine:

ConsumerWrite(c) ==
    /\ consumerBuffer[c] # "idle"
    /\ IF consumerBuffer[c] = << >>
       THEN queue' = queue   \* nothing to dequeue, leave unchanged
       ELSE queue' = Tail(consumerBuffer[c])
    /\ consumerBuffer[c]' = "idle"
    /\ UNCHANGED producerBuffer, drainFlag

Now we have a model where concurrency can cause subtle errors. Next we need to specify the correctness properties and run the model checker.

4. Specifying Correctness Properties

4.1 Safety: No Lost Tasks

We want to ensure that tasks are never lost. A task is lost if it is enqueued but never dequeued and not present in the queue at the end (under some fairness assumption). To check safety, we can use an invariant: the multiset of tasks that have been enqueued equals the multiset of tasks dequeued plus the tasks currently in the queue. But we also have Drain, which effectively removes tasks. So we need to distinguish between tasks that are drained (removed by drain) from those dequeued normally.

We can introduce a variable enqueuedTasks (a set of tasks that have been enqueued at some point) and dequeuedTasks (set of tasks removed by dequeue), and drainedTasks (set of tasks removed by drain). Then safety invariant:

Invariant ==
    \* All enqueued tasks are accounted for: either still in queue, or dequeued, or drained.
    \* (We assume tasks are unique  can enforce by using a set of TaskID)
    (enqueuedTasks \ (dequeuedTasks \cup drainedTasks)) \subseteq SetOfElements(queue)
    /\ SetOfElements(queue) \subseteq enqueuedTasks

But SetOfElements is a custom operator. Let’s define:

SetOfElements(seq) == { x \in TaskID : \E i \in 1..Len(seq) : seq[i] = x }

However, this invariant is tricky because duplicates can exist. It’s easier to assume all tasks are distinct (which is reasonable for a work queue, you can assign unique IDs). Then we can use a single counter: the number of enqueued minus dequeued minus drained equals length of queue.

VARIABLE countEnqueued, countDequeued, countDrained

Invariant ==
    (countEnqueued - countDequeued - countDrained) = Len(queue)

And update these counters in each action:

  • ProducerWrite: countEnqueued' = countEnqueued + 1
  • ConsumerWrite: if queue was non‑empty in snapshot, then countDequeued' = countDequeued + 1
  • Drain: countDrained' = countDrained + Len(queue)

This gives us an easy invariant to check.

4.2 Liveness: Tasks Eventually Get Processed

For liveness, we need to specify that if a task is enqueued and never drained, it will eventually be dequeued. In TLA⁺ we use temporal operators: [] (always), <> (eventually). The formula for liveness is:

Liveness ==
    \A t \in TaskID :
        ( (t \in enqueuedTasks) /\ ~(t \in drainedTasks) )
        => <>(t \in dequeuedTasks)

But this is not a property we can check with TLC for unbounded state space because it requires fairness assumptions about producers and consumers. We typically check liveness with fairness constraints: we assume that actions are weakly fair (if they are continuously enabled, they eventually happen). For the model checker, we need to bound the number of tasks and the number of steps, but we can still check for deadlocks or starvations.

4.3 Absence of Deadlocks and Stalls

A common check is that the system does not reach a deadlock (no action is enabled) unless it is the expected terminal state. For our model, we can specify a progress property: “if there is at least one task in the queue, then some consumer is able to dequeue it eventually.” But because the queue can be drained, we need to ensure that if drainage is not active, consumers can make progress.

For simplicity, we’ll focus on safety first.

5. Model Checking with TLC

Let’s feed our model into TLC. We create a configuration file that specifies constants and what to check.

Example TLC config (in WorkQueue.cfg):

CONSTANTS
    Producers = 2
    Consumers = 2
    TaskID = {t1, t2, t3}
    Null = NULLTOKEN   (we define a placeholder)

INVARIANT Invariant

Then run TLC:

tlc2 -config WorkQueue.cfg WorkQueue.tla

TLC will explore all reachable states. With small constants (2 producers, 2 consumers, 3 tasks) the state space is manageable. It will output whether the invariant holds.

5.1 Expected Counterexample

Given our non‑atomic model, TLC will find a violation of the invariant. For example, a trace:

  1. Producer 1 enqueues t1 (countEnqueued=1, queue=[t1]).
  2. Consumer 1 reads queue => snapshot [t1].
  3. Drain occurs (queue=[], countDrained=1).
  4. Consumer 1 writes Tail(snapshot) => queue = [] (since Tail([t1]) = []). This is correct: task t1 is now drained, countDequeued unchanged, queue empty. Invariant: (1 - 0 - 1) = 0 holds? Actually countEnqueued=1, countDrained=1, countDequeued=0, Len(queue)=0 -> 1-0-1=0 holds. That might seem correct. But what if the snapshot includes tasks that were never enqueued? Actually the snapshot was taken before drain, so it’s fine. Need a different interleaving.

Better example:

  1. Producer 1 enqueues t1 (queue=[t1], countEnqueued=1).
  2. Consumer 1 reads queue => snapshot [t1].
  3. Producer 2 enqueues t2 (queue=[t1, t2], countEnqueued=2).
  4. Consumer 1 writes Tail(snapshot) => queue = [] (because it only saw [t1]). Now t2 is lost! Invariant: countEnqueued=2, countDequeued=0, countDrained=0, Len(queue)=0 -> 2 != 0. So invariant fails. TLC will produce this trace.

This is a classic race: a consumer that reads the queue, gets a stale snapshot, and then writes back the tail of that snapshot, effectively overwriting newer enqueues. This exactly mirrors the kind of race we had in production (though our production race was about null head pointer, the effect was similar).

5.2 Fixing the Model – Adding Atomicity

The fix is to ensure that the dequeue operation is atomic at the level of one task removal. In the model, we can replace the read‑write pair with a single atomic action:

AtomicDequeue(c) ==
    /\ queue # << >>
    /\ queue' = Tail(queue)
    /\ countDequeued' = countDequeued + 1

And remove the read/write steps. Then the invariant should hold.

But the whole point of our post is that the real implementation was not atomic. So we need to model a real implementation that is atomic through proper locking. For a concurrent queue, the typical correct implementation uses a lock around every operation. In TLA⁺, we can model a lock as a variable lock that is acquired and released.

Let’s add a lock.

6. Lock‑Based Correct Implementation

We model a global lock that any action must acquire before modifying the queue.

VARIABLE lock          \* either 0 (unlocked) or thread ID

AcquireLock(thread) ==
    /\ lock = 0
    /\ lock' = thread
    /\ UNCHANGED queue, countEnqueued, countDequeued, countDrained

ReleaseLock(thread) ==
    /\ lock = thread
    /\ lock' = 0
    /\ UNCHANGED queue, countEnqueued, countDequeued, countDrained

EnqueueWithLock(p, t) ==
    /\ lock = 0
    /\ lock' = p
    /\ queue' = Append(queue, t)
    /\ countEnqueued' = countEnqueued + 1
    /\ lock''? We need two steps: acquire, enqueue, release. Better to do atomic with lock.

Instead, we model all three as one atomic action:

Enqueue(p) ==
    /\ lock = 0
    /\ \E t \in TaskID :
        queue' = Append(queue, t) /\
        countEnqueued' = countEnqueued + 1 /\
        lock' = 0   \* release lock implicitly

But that still allows interleaving between lock acquisition and release? No, we do it atomically: the action acquires lock, modifies queue, releases lock in one step. That's fine for model checking, but it hides the fact that the acquire and release are separate instructions. To reveal the race, we need to break them.

Let's keep the model with separate acquire and release, but then we must also ensure mutual exclusion: only one thread can hold the lock. TLC will find deadlocks if a thread acquires and then never releases.

Actually, for the lockbased model to be correct, we must ensure that threads always release the lock. In our first race model, we didn't have locks. The fix is to introduce locks and enforce that all queue operations are done while holding the lock.

Let's write a more faithful model: each thread cycles through states: `idle`, `waitingForLock`, `hasLock`, `releasing`. We'll use `pc` for each thread.

Given length constraints, I'll skip the full lock model details. The key point: with locks, the mutual exclusion prevents the race we saw, and the invariant holds.

## 7. Refining the Model to Match the Real Implementation

The bug we encountered was not due to lack of locking per se, but because the implementation used a nonblocking technique (CAS) and the drain operation was not properly synchronized with enqueue/dequeue. Refining our model to include a compareandswap (CAS) on the head pointer would be the next step. But that requires modeling memory consistency (e.g., the ABA problem). We can model a simplified CAS as a conditional action.

However, the goal of this post is not to recreate the exact bug but to demonstrate how formal verification helps catch it. Our nonatomic readwrite model already catches a similar race. The real production race was a checkthenact on `head == NULL`  after checking, another thread drained, then the original thread proceeded to write `head` thinking it was NULL, causing a null pointer assignment. Our model with `ConsumerRead` and `ConsumerWrite` captures that pattern.

## 8. Practical Tips for Modeling

- **Start simple**: model only the essential aspects. Add complexity gradually.
- **Use symmetry**: TLC can exploit symmetry sets to reduce state space (e.g., producers are interchangeable).
- **Bound your model**: Use small constant values (2 producers, 2 consumers, 3 tasks). If the property holds for these bounds, it often holds for larger ones (though not guaranteed). Formal verification is not a proof for unbounded, but for many practical systems, it's convincing.
- **Check both safety and liveness with fairness**: Use `FAIRNESS` assumptions in the model.
- **Iterate until no counterexample**: Each counterexample teaches you about a possible bug.

## 9. From TLA⁺ to Production Code

Once the TLA⁺ model is verified, you can derive a correct implementation. The specification acts as a blueprint. For the work queue, the correct implementation would use a lock or a fine‑grained lock‑free algorithm (like Michael‑Scott). The TLA⁺ model can be used to guide the implementation and to write tests: the counterexamples from TLC become test cases.

For example, the trace we found (old snapshot causing lost tasks) can be turned into a unit test with specific thread interleavings using a mock scheduler.

## 10. Conclusion

Going back to that 3:47 AM wake‑up call: after we fixed the single `synchronized` block, we could have used formal verification to check that no other races existed. We didn't, and we were lucky that the bug never resurfaced. But now, with TLA, we can sleep better. Formal verification doesn't replace testing; it complements it by providing exhaustive analysis of concurrency. It forces you to think precisely about every possible interleaving, and the model checker gives you concrete counterexamples when something is wrong.

In this post, we built a TLA⁺ model of a concurrent work queue, revealed a subtle race condition through the model checker, and discussed how to fix it. I encourage you to try TLA⁺ on your own concurrent systems. The learning curve is steep, but the payoff is immense: the confidence that your system will not fail under the most unexpected interleaving.

Next time you write a concurrent queue, ask yourself: “Can I prove it correct?” With TLA⁺, you can.

---

*Further reading: Leslie Lamport’s “Specifying Systems” (free online). The TLA⁺ toolbox.*