Branch Prediction and Speculative Execution: How Modern CPUs Gamble on the Future

2021-08-15 · Leonardo Benicio

Explore how modern processors predict branch outcomes and execute instructions speculatively, the algorithms behind branch predictors, the performance implications for your code, and the security vulnerabilities like Spectre that emerged from these optimizations.

Modern CPUs are marvels of prediction. Every time your code branches—every if statement, every loop iteration, every function call—the processor makes a bet on what happens next. Get it right, and execution flows at full speed. Get it wrong, and the pipeline stalls while work is thrown away. Understanding branch prediction transforms how you think about code performance. This post explores the algorithms, trade-offs, and real-world implications of one of computing’s most important optimizations.

1. Why Prediction Matters

Consider a simple loop:

for (int i = 0; i < 1000000; i++) {
    sum += array[i];
}

At the machine level, this becomes a branch instruction that checks i < 1000000. A modern CPU might take 15-20 cycles to determine the branch outcome (waiting for the comparison to complete). With a 3 GHz processor, that’s 5-7 nanoseconds per iteration—just for the branch.

But the processor doesn’t wait. It predicts the branch will be taken (loop continues) and keeps fetching instructions. One million correct predictions mean the branch overhead is nearly eliminated. The final iteration mispredicts (loop exits), costing ~15 cycles—a trivial price for eliminating millions of stalls.

1.1 The Pipeline Problem

Modern CPUs use deep pipelines to maximize throughput:

Instruction Pipeline (simplified):
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│Fetch │Decode│Rename│Issue │Execute│Memory│Retire│
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘
   │      │      │      │       │      │      │
Cycle 1   2      3      4       5      6      7

While instruction N is executing, the CPU is already fetching N+1, N+2, N+3, and so on. But what if N is a branch? The CPU doesn’t know which instructions come next until the branch resolves.

Without prediction: Stop fetching until the branch resolves. The pipeline drains, wasting cycles equal to the pipeline depth (15-20+ stages on modern CPUs).

With prediction: Guess the branch outcome, keep fetching instructions from the predicted path. If correct, no cycles lost. If wrong, flush the speculative work and restart.

1.2 The Cost of Misprediction

When a branch mispredicts:

All speculatively executed instructions are discarded
The pipeline is flushed
Fetch restarts from the correct path
The pipeline must refill before useful work resumes

Misprediction penalty: 10-25 cycles on modern CPUs, depending on how deep the speculation went.

A branch that mispredicts 10% of the time with a 20-cycle penalty:

Average cost: 0.10 × 20 = 2 cycles per branch
For a tight loop with one branch per iteration, this is significant!

2. Static Branch Prediction

Early processors used simple, static rules.

2.1 Backward Taken, Forward Not Taken (BTFNT)

The simplest heuristic:

Backward branches (target address < current address): Predict taken. These are usually loops.
Forward branches (target address > current address): Predict not taken. These are usually early exits.

// Backward branch - usually loops, predict taken
loop:
    ...
    jnz loop    // Predict: taken

// Forward branch - usually ifs, predict not taken
    test eax, eax
    jz skip     // Predict: not taken
    ...
skip:

BTFNT achieves ~65% accuracy on typical code—better than random, but far from ideal.

2.2 Compiler Hints

Some ISAs allow the compiler to provide hints:

// GCC built-in for branch hints
if (__builtin_expect(error_condition, 0)) {
    // Unlikely path
    handle_error();
}

// Linux kernel macros
if (likely(condition)) { ... }
if (unlikely(condition)) { ... }

These hints can influence:

Static prediction (on CPUs that use hints)
Code layout (likely path falls through, unlikely path jumps)
Instruction scheduling around the branch

2.3 Limitations of Static Prediction

Static prediction can’t adapt to:

Runtime data patterns
Phase behavior (different behavior at different times)
Input-dependent branches

Modern CPUs use dynamic prediction that learns from runtime behavior.

3. Dynamic Branch Prediction

Dynamic predictors observe branch behavior and learn patterns.

3.1 One-Bit Predictor

The simplest dynamic predictor: remember the last outcome.

State: Last outcome (Taken or Not Taken)
Prediction: Same as last outcome

Branch history:  T T T T T T T N T T T T
Predictions:     ? T T T T T T T N T T T
Correct:         - ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓

Problem: A loop that executes N times will mispredict twice per invocation:

First iteration: May mispredict if loop wasn’t taken last time
Last iteration: Always mispredicts (predicts taken, but loop exits)

For a loop executed millions of times, two mispredictions per invocation is fine. For a loop executed 5 times inside an outer loop of millions, that’s 2 million mispredictions!

3.2 Two-Bit Saturating Counter

Add hysteresis: don’t change prediction on a single wrong outcome.

States:
  00: Strongly Not Taken (predict NT)
  01: Weakly Not Taken (predict NT)
  10: Weakly Taken (predict T)
  11: Strongly Taken (predict T)

Transitions:
  On Taken outcome: Increment (max 11)
  On Not Taken outcome: Decrement (min 00)

State diagram:
              Taken             Taken
          ┌──────────┐      ┌──────────┐
          ▼          │      ▼          │
     ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
     │   00   │──│   01   │──│   10   │──│   11   │
     │Strong NT│  │Weak NT │  │Weak T  │  │Strong T│
     └────────┘  └────────┘  └────────┘  └────────┘
          │          ▲      │          ▲
          └──────────┘      └──────────┘
            Not Taken         Not Taken

Benefit: A single anomaly doesn’t flip the prediction. The inner loop problem is greatly reduced:

Loop entry: May be weak, but strong after first execution
Loop exit: Mispredicts, but stays “weakly taken” for next invocation
Next invocation: Correctly predicts taken on first iteration!

3.3 Branch Target Buffer (BTB)

Prediction isn’t just about direction (taken/not taken). For taken branches, we need the target address.

Branch Target Buffer: A cache mapping branch addresses to target addresses.

┌─────────────────┬────────────────────┬──────────────┐
│  Branch PC Tag  │  Target Address    │  Prediction  │
├─────────────────┼────────────────────┼──────────────┤
│  0x4000_1234    │  0x4000_5678       │  11 (Strong T)│
│  0x4000_2000    │  0x4000_2100       │  01 (Weak NT) │
│  ...            │  ...               │  ...          │
└─────────────────┴────────────────────┴──────────────┘

When a branch is fetched:

Look up branch PC in BTB
If hit: Use stored target and prediction
If miss: Use static prediction, compute target when branch executes

BTB entries are limited, so not all branches can be tracked. Working set size matters!

4. Correlating Predictors

Some branches correlate with other branches or with global execution history.

4.1 Local History

A branch may have a pattern based on its own history:

for (int i = 0; i < n; i++) {
    if (i % 2 == 0) {
        even_work();
    }
}

This branch alternates: T, N, T, N, T, N…

A local history predictor tracks recent outcomes for each branch:

Local History Table:
┌─────────────────┬────────────────┬──────────────┐
│  Branch PC      │  History (4b)  │  Prediction  │
├─────────────────┼────────────────┼──────────────┤
│  0x4000_1234    │  1010          │  Pattern: alt│
│  0x4000_2000    │  1111          │  Pattern: T  │
└─────────────────┴────────────────┴──────────────┘

The history bits index into a Pattern History Table (PHT) of 2-bit counters:

Pattern History Table for branch 0x4000_1234:
History │ Counter │ Prediction
────────┼─────────┼───────────
  0000  │   01    │    NT
  0001  │   11    │    T
  0010  │   10    │    T
  ...   │   ...   │   ...
  1010  │   01    │    NT  ← Current history predicts NT

This captures repeating patterns within a single branch.

4.2 Global History

Branches often correlate with each other:

if (x < 0) {
    // ...
}
// Later:
if (x < 0) {  // Same condition - perfectly correlated!
    // ...
}

Or more subtly:

if (ptr != NULL) {
    if (ptr->valid) {  // Very likely taken if first branch was taken
        // ...
    }
}

Global History Register (GHR): A shift register tracking the outcomes of recent branches (not just this branch).

GHR after sequence T, N, T, T, N, T:
┌───┬───┬───┬───┬───┬───┐
│ T │ N │ T │ T │ N │ T │
└───┴───┴───┴───┴───┴───┘
  ↑
Most recent

The GHR indexes into a global Pattern History Table:

Prediction for current branch = PHT[hash(BranchPC, GHR)]

4.3 gshare Predictor

The gshare predictor XORs the branch PC with the GHR to index the PHT:

              Branch PC
                 │
                 ▼
              ┌─────────┐
              │  XOR    │◄──── Global History Register
              └────┬────┘
                   │
                   ▼
         ┌─────────────────────┐
         │  Pattern History    │
         │  Table (2-bit       │
         │  counters)          │
         └─────────────────────┘
                   │
                   ▼
              Prediction

gshare captures both:

Branch-specific patterns (from PC bits)
Global correlation (from GHR bits)

It’s simple, effective, and widely used as a baseline.

4.4 Tournament Predictors

Different predictors excel at different branch types. A tournament predictor uses multiple predictors and learns which is best for each branch:

                    ┌──────────────┐
         ┌─────────│ Local Pred.  │─────────┐
         │         └──────────────┘         │
         │                                  │
         │         ┌──────────────┐         ▼
Branch ──┼────────│ Global Pred. │────────►MUX ──► Prediction
         │         └──────────────┘         ▲
         │                                  │
         │         ┌──────────────┐         │
         └─────────│   Chooser    │─────────┘
                   └──────────────┘

The chooser table tracks which predictor was more accurate for each branch (or branch pattern). Alpha 21264 used this design, achieving ~95% accuracy.

5. Modern Branch Predictors

Contemporary CPUs use highly sophisticated predictors with multiple levels.

5.1 TAGE: Tagged Geometric History Length

TAGE (TAgged GEometric) is the dominant predictor design in modern CPUs.

Key insight: Different branches need different history lengths. Loop counters need short history. Complex control flow needs long history.

TAGE uses multiple tables with geometrically increasing history lengths:

                      Tagged Tables
             ┌──────┬──────┬──────┬──────┐
             │ T1   │ T2   │ T3   │ T4   │
             │ 4-bit│ 8-bit│16-bit│64-bit│
             │ hist │ hist │ hist │ hist │
             └──────┴──────┴──────┴──────┘
                │      │      │      │
                ▼      ▼      ▼      ▼
             ┌───────────────────────────┐
             │       Provider Select     │
             │   (longest match wins)    │
             └───────────────────────────┘
                          │
                          ▼
                     Prediction

Each table entry has:

Tag (partial PC + history hash)
Prediction counter (2-3 bits)
Useful counter (for replacement)

The table with the longest matching history provides the prediction. TAGE achieves ~97% accuracy on typical workloads.

5.2 Perceptron Predictors

Neural-inspired predictors learn weights for history bits:

Prediction = sign(w₀ + Σ(wᵢ × hᵢ))

Where:
  w₀ = bias weight
  wᵢ = weight for history bit i
  hᵢ = history bit i (+1 for taken, -1 for not taken)

Training is simple: if mispredicted, adjust weights toward the correct outcome.

Perceptrons can capture complex correlations that table-based predictors miss. AMD’s Zen architecture uses perceptron-based predictors.

5.3 Loop Predictors

Loops have predictable iteration counts. Dedicated loop predictors detect and track loops:

Loop Predictor Entry:
┌─────────────┬───────────┬───────────┬──────────┐
│  Branch PC  │  Limit    │  Count    │ Confident│
├─────────────┼───────────┼───────────┼──────────┤
│ 0x4000_1234 │    100    │    57     │   Yes    │
└─────────────┴───────────┴───────────┴──────────┘

Prediction: Taken until Count reaches Limit

When a loop is detected (repeated back-edge), the predictor:

Learns the iteration count
Predicts taken until count reached
Predicts not taken on final iteration

This eliminates the “last iteration” misprediction that plagues other predictors.

5.4 Return Address Stack

Function returns are indirect branches (target varies). But they follow a pattern: return to the instruction after the call.

Return Address Stack (RAS): A small stack that tracks call sites.

Call site 0x1000 ──► Push 0x1004
Call site 0x2000 ──► Push 0x2004
Call site 0x3000 ──► Push 0x3004

Return ──► Pop 0x3004 (predict return to 0x3004)
Return ──► Pop 0x2004 (predict return to 0x2004)
Return ──► Pop 0x1000 (predict return to 0x1004)

RAS handles returns with near-perfect accuracy for normal call/return patterns. Problems arise with:

Exceptions (unwind stack without returns)
Tail calls (return address isn’t pushed)
Speculation (speculative calls corrupt RAS)

Modern CPUs use techniques like checkpointing to recover RAS on misprediction.

6. Indirect Branch Prediction

Most branches have a fixed target (direct branches). But some have variable targets:

// Virtual function call
obj->method();  // Target depends on object's vtable

// Switch statement (jump table)
switch (x) { ... }  // Target depends on x

// Function pointer
callback(data);  // Target is the function pointer value

6.1 Indirect Target Array (ITA)

Simple approach: cache recently seen targets.

┌─────────────────┬────────────────────┐
│  Branch PC      │  Recent Targets    │
├─────────────────┼────────────────────┤
│  0x4000_1234    │  0x5000, 0x6000    │
│  0x4000_2000    │  0x7000            │
└─────────────────┴────────────────────┘

Predict the most recently seen target. Works well for monomorphic call sites (one target) but poorly for polymorphic calls.

6.2 Indirect Target Predictor with History

Like branch direction, indirect targets can correlate with history:

switch (state) {
    case A: next_state = B; break;
    case B: next_state = C; break;
    case C: next_state = A; break;
}
// Repeating pattern: A→B→C→A→B→C...

Modern indirect predictors use history to predict targets:

Prediction = TargetTable[hash(BranchPC, GHR)]

6.3 Virtual Call Optimization

Virtual calls are common in OOP code. Techniques to help prediction:

Devirtualization: Compiler converts virtual calls to direct calls when the type is known.

Polymorphic inline caches: At runtime, cache recent receiver types and inline the predicted path.

// Polymorphic inline cache (pseudocode)
if (obj->type == cached_type) {
    cached_method(obj);  // Fast path, direct call
} else {
    obj->vtable[method_index](obj);  // Slow path
    cached_type = obj->type;  // Update cache
}

7. Speculative Execution

Branch prediction enables speculation—executing instructions before knowing if they should execute.

7.1 How Speculation Works

Time ──────────────────────────────────────────────►

Instruction stream: A, B, BRANCH, C, D, E...

Pipeline:
Cycle 1: Fetch A
Cycle 2: Fetch B, Decode A
Cycle 3: Fetch BRANCH, Decode B, Execute A
Cycle 4: Fetch C (speculative!), Decode BRANCH, Execute B
Cycle 5: Fetch D (speculative!), Decode C, Execute BRANCH
         └─► Branch resolves: prediction was CORRECT
Cycle 6: Fetch E, Decode D, Execute C  (all valid!)

If prediction was WRONG:
Cycle 5: Branch resolves wrong
         Flush C, D from pipeline
         Restart fetch from correct path

7.2 Speculative State Management

Speculative instructions can’t modify permanent state until the branch is confirmed. CPUs use:

Reorder Buffer (ROB): Instructions complete out-of-order but retire (commit) in order. Speculative instructions wait in the ROB until the branch commits.

Physical Register File: Results are written to physical registers. Architectural registers are updated only on retirement.

Store Buffer: Stores wait in a buffer until retirement, then write to cache.

7.3 Memory Ordering and Speculation

Speculative loads are tricky:

if (x < array_size) {      // Mispredicted as taken
    value = array[x];       // Speculative load (x might be out of bounds!)
}

The load executes speculatively, potentially accessing invalid memory. Modern CPUs:

Allow speculative loads (for performance)
Check bounds when the load retires
Squash if the load was invalid

But the load still affects microarchitectural state (caches, TLBs), leading to security issues…

8. Security: Spectre and Friends

Speculative execution vulnerabilities shocked the industry in 2018. They exploit the side effects of mispredicted speculation.

8.1 Spectre Variant 1: Bounds Check Bypass

if (x < array1_size) {           // Can be mispredicted!
    y = array2[array1[x] * 256]; // Speculatively executed with bad x
}

Attack:

Train predictor to expect branch taken
Call with x out of bounds
Speculative execution reads array1[x] (secret data)
Uses secret as index into array2, loading a cache line
Branch misprediction detected, execution rolled back
But: Cache state persists! The loaded cache line reveals the secret

This leaks arbitrary memory through cache timing side channels.

8.2 Spectre Variant 2: Branch Target Injection

Indirect branch prediction can be poisoned:

Attacker trains BTB with malicious target
Victim process executes indirect branch
Speculation jumps to attacker-chosen “gadget”
Gadget leaks secrets via cache side channel
Speculation rolled back, but cache side effects remain

This allows cross-process and cross-privilege attacks.

8.3 Mitigations

Software mitigations:

// Speculation barrier
if (x < array_size) {
    __asm__ volatile("lfence");  // Block speculation
    y = array[x];
}

// Array index masking
x &= ~(-(x >= array_size));  // Clamp x to valid range

Hardware mitigations:

IBRS/IBPB: Indirect Branch Restricted/Prediction Barrier
STIBP: Single Thread Indirect Branch Predictor
Enhanced IBRS: Hardware mode that isolates prediction
Microcode updates: Various fixes for specific variants

Performance impact: Mitigations can cost 2-30% performance depending on workload and mitigation level.

8.4 The Fundamental Problem

Speculation side channels exist because:

Prediction affects what instructions execute (even speculatively)
Speculative execution affects microarchitectural state
Microarchitectural state is observable through timing

Fixing this fundamentally would require either:

Never speculate (massive performance loss)
Isolate all microarchitectural state (extremely difficult)
Eliminate timing side channels (practically impossible)

Modern CPUs continue to balance security and performance with targeted mitigations.

9. Writing Branch-Prediction-Friendly Code

Understanding prediction helps you write faster code.

9.1 Make Branches Predictable

Sort data when possible:

// Unpredictable: random true/false
for (int i = 0; i < n; i++) {
    if (data[i] >= 128) {  // ~50% taken if random
        sum += data[i];
    }
}

// Better: sort first, then all false followed by all true
std::sort(data, data + n);
for (int i = 0; i < n; i++) {
    if (data[i] >= 128) {  // First N iterations false, rest true
        sum += data[i];
    }
}

The sorted version can be 3-5x faster due to better prediction!

9.2 Use Conditional Moves

Replace branches with conditional moves when possible:

// Branchy version
if (a > b) {
    max = a;
} else {
    max = b;
}

// Branchless version (compiler may generate CMOV)
max = (a > b) ? a : b;

// Explicitly branchless
max = b ^ ((a ^ b) & -(a > b));

Conditional moves have fixed latency (~1-2 cycles) regardless of data patterns. Useful when:

Branch is unpredictable (near 50/50)
Both paths are simple (cheap to compute both)

Warning: Don’t use branchless code blindly! If the branch is predictable, branches are faster because they can speculate ahead.

9.3 Loop Unrolling

Reduce branch overhead by processing multiple elements per iteration:

// Original: 1 branch per element
for (int i = 0; i < n; i++) {
    sum += a[i];
}

// Unrolled: 1 branch per 4 elements
for (int i = 0; i < n; i += 4) {
    sum += a[i] + a[i+1] + a[i+2] + a[i+3];
}

Fewer branches = fewer opportunities for misprediction.

9.4 Profile-Guided Optimization (PGO)

Let the compiler learn branch behavior:

# Step 1: Build instrumented binary
gcc -fprofile-generate -O2 program.c -o program

# Step 2: Run with representative workload
./program typical_input.txt

# Step 3: Rebuild with profile data
gcc -fprofile-use -O2 program.c -o program

PGO enables:

Better branch prediction hints
Hot path optimization
Cold path outlining
Better inlining decisions

PGO can improve performance by 10-30% for branch-heavy code.

9.5 Avoid Unpredictable Indirect Branches

Virtual calls and function pointers are indirect branches:

// Hard to predict: polymorphic
for (Shape* s : shapes) {
    s->draw();  // Different target each time
}

// Easier to predict: monomorphic or sorted
std::sort(shapes.begin(), shapes.end(), 
          [](auto a, auto b) { return typeid(*a).name() < typeid(*b).name(); });
for (Shape* s : shapes) {
    s->draw();  // Targets cluster together
}

10. Measuring Branch Performance

10.1 Performance Counters

Modern CPUs provide hardware counters for branch events:

# Linux perf
perf stat -e branches,branch-misses ./program

# Example output:
#  1,234,567,890  branches
#     12,345,678  branch-misses  # 1.00% of all branches

Key metrics:

Branch misprediction rate: Aim for < 2% on hot paths
Instructions per branch: Lower means more control flow
Branch MPKI: Mispredictions per kilo-instructions

10.2 Microbenchmarking

Isolate branch prediction effects:

// Predictable pattern
for (int i = 0; i < N; i++) {
    if (i % 2 == 0) sum++;  // Alternating: TNTNTN...
}

// Unpredictable pattern  
for (int i = 0; i < N; i++) {
    if (rand() % 2 == 0) sum++;  // Random: ???
}

Compare performance to quantify prediction impact.

10.3 CPU-Specific Analysis

Intel VTune and AMD uProf provide detailed branch analysis:

Per-branch misprediction rates
BTB hit/miss rates
Indirect branch target patterns
Speculation efficiency

11. Branch Prediction Across Architectures

11.1 x86 (Intel/AMD)

Intel Haswell and later use TAGE-like predictors with:

Multiple prediction tables
Loop predictors
Return address stack (16-32 entries)
Indirect target predictors

AMD Zen uses perceptron-based predictors with:

TAGE backup
Large history lengths
Sophisticated indirect prediction

Both achieve ~97% accuracy on typical workloads.

11.2 ARM

ARM cores vary widely:

Cortex-A55 (efficiency): Simple bimodal predictor, ~85% accuracy
Cortex-A78 (performance): TAGE-like, ~95% accuracy
Apple M-series: Highly sophisticated, possibly perceptron-based

ARM’s big.LITTLE design means prediction quality varies between cores.

11.3 RISC-V

RISC-V is an ISA, not an implementation. Predictors vary by vendor:

SiFive U74: gshare-based, moderate accuracy
Alibaba XuanTie: TAGE-based, high accuracy
Research implementations: Testing novel predictor designs

11.4 GPU “Prediction”

GPUs handle branches differently:

SIMD Execution:
Thread 0:  if (true)  { A } else { B }
Thread 1:  if (true)  { A } else { B }
Thread 2:  if (false) { A } else { B }
Thread 3:  if (true)  { A } else { B }

Execution:
1. All threads execute A (thread 2 masked)
2. All threads execute B (threads 0,1,3 masked)

GPUs don’t predict; they execute both paths with masking. This is called divergence and is why GPU code should minimize branches.

12. Advanced Topics

12.1 Branch Prediction and SMT

Simultaneous Multithreading (Hyper-Threading) shares prediction resources:

BTB entries are shared or partitioned
GHR may be per-thread or shared
Prediction tables compete for space

Competing threads can cause prediction interference. Some CPUs partition resources for security (prevent cross-thread Spectre attacks).

12.2 Speculative Memory Disambiguation

Modern CPUs speculate on memory dependencies:

store [A], value1
load  [B]  // Does B alias A?

// CPU speculates "no alias", executes load early
// If wrong, replay the load after the store completes

This is called memory disambiguation prediction. Misprediction causes pipeline flushes similar to branch misprediction.

12.3 Value Prediction

Why stop at predicting branches? We could predict values:

x = load_from_memory();  // Predict x = 42
y = x + 1;               // Speculatively compute y = 43

Value prediction has been researched for decades but isn’t in mainstream CPUs due to:

Complexity of recovery on misprediction
Limited accuracy for most values
Area/power costs

Some specialized uses exist (stride predictors for addresses).

12.4 Machine Learning Predictors

Research explores ML for prediction:

Neural networks for branch prediction
Reinforcement learning for predictor training
Learned index structures for BTB

Challenge: ML inference must complete in < 1 cycle, limiting model complexity.

13. Historical Perspective

13.1 Early Predictors (1980s)

Simple static prediction
One-bit dynamic predictors
Small BTBs

Accuracy: ~70-80%

13.2 Two-Level Predictors (1990s)

Local and global history
gshare, gselect
Tournament predictors

Accuracy: ~90-95%

13.3 Modern Predictors (2000s-present)

TAGE and variants
Perceptron predictors
Sophisticated loop/return prediction

Accuracy: ~97%+

Sophisticated loop/return prediction

Accuracy: ~97%+

13.4 The Accuracy Wall

Prediction accuracy has plateaued:

Easy branches are already perfect
Hard branches are fundamentally unpredictable
Diminishing returns from predictor complexity

Future gains likely come from:

Reducing misprediction penalty
Better speculative execution management
Compiler assistance

14. Real-World Case Studies

Understanding branch prediction theory is valuable, but seeing real performance impacts cements the knowledge.

14.1 The Sorted Array Benchmark

The famous Stack Overflow question “Why is processing a sorted array faster than an unsorted array?” demonstrates branch prediction perfectly:

int main() {
    const int arraySize = 32768;
    int data[arraySize];
    
    // Fill with random values 0-255
    for (int c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    
    // Optionally sort
    // std::sort(data, data + arraySize);
    
    long long sum = 0;
    for (int i = 0; i < 100000; ++i) {
        for (int c = 0; c < arraySize; ++c) {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
}

Results:

Unsorted: ~10 seconds
Sorted: ~2 seconds

The sorted version is 5x faster because the branch becomes predictable. First half: all not-taken. Second half: all taken.

14.2 JSON Parsing Branches

JSON parsers are branch-heavy, checking character types continuously:

while (*p) {
    if (*p == '"') parse_string();
    else if (*p == '{') parse_object();
    else if (*p == '[') parse_array();
    else if (isdigit(*p)) parse_number();
    // ...
}

Optimized parsers like simdjson use:

SIMD to classify multiple characters at once
Branchless state machines
Data-parallel parsing

simdjson achieves 2-4GB/s versus ~200MB/s for traditional parsers, largely by eliminating unpredictable branches.

14.3 Game Engine Physics

Physics engines often process many objects:

for (Entity& e : entities) {
    if (e.hasPhysics) {
        if (e.isAwake) {
            if (e.collisionEnabled) {
                // Process physics
            }
        }
    }
}

Optimizations:

Sort entities by type (all physics entities together)
Use data-d8a620d design (separate arrays for different properties)
Process in batches of similar entities

Modern engines achieve 10-100x improvements through branch-friendly data organization.

14.4 Database Query Processing

Database queries involve many branches:

// Filter predicate
for (Row& row : table) {
    if (row.age > 30 && row.salary > 50000 && row.dept == "ENG") {
        results.push_back(row);
    }
}

Database optimizations:

Vectorized execution (process columns in batches)
Predicate reordering (most selective first)
Compiled queries (eliminate interpretation overhead)

Vectorized databases like DuckDB, ClickHouse achieve order-of-magnitude improvements.

15. Compiler Optimizations for Branches

Compilers employ sophisticated techniques to improve branch behavior.

15.1 If-Conversion

Convert branches to conditional moves:

// Original
if (condition) {
    x = a;
} else {
    x = b;
}

// If-converted (compiler generates CMOV)
x = condition ? a : b;

Compilers apply if-conversion when:

Both paths are simple
Branch is likely unpredictable
Architecture supports conditional moves

15.2 Branch Probability Propagation

Compilers track branch probabilities through the code:

if (error_check()) {  // Rare: 0.01%
    handle_error();   // Also rare
    return;
}
// Normal path: 99.99%
do_work();

Probabilities inform:

Code layout (hot path falls through)
Inlining decisions (inline hot paths)
Register allocation (optimize for hot path)

15.3 Hot/Cold Splitting

Move unlikely code out of hot paths:

// Before
void process() {
    if (unlikely(error)) {
        // 100 lines of error handling
    }
    // Hot path
}

// After (compiler splits)
void process() {
    if (unlikely(error)) {
        handle_error_cold();  // Outlined to separate function
    }
    // Hot path (better instruction cache)
}

Cold code outlining improves instruction cache utilization for hot paths.

15.4 Loop Unswitching

Hoist loop-invariant conditions:

// Before
for (int i = 0; i < n; i++) {
    if (flag) {  // Loop-invariant!
        work_a(i);
    } else {
        work_b(i);
    }
}

// After unswitching
if (flag) {
    for (int i = 0; i < n; i++) {
        work_a(i);
    }
} else {
    for (int i = 0; i < n; i++) {
        work_b(i);
    }
}

Eliminates n-1 branches, improving prediction and enabling further optimizations.

15.5 Speculative Compilation

JIT compilers can optimize based on runtime behavior:

// HotSpot JVM
if (obj instanceof Dog) {  // 99% Dog
    ((Dog) obj).bark();
}

// JIT generates:
// if (obj.class="cc5659a" Dog.class) {  // Fast path: direct call
//     dog_bark(obj);
// } else {
//     slow_path_instanceof(obj);  // Deoptimize if assumption broken
// }

This “speculative optimization” bets on observed patterns, with fallback for uncommon cases.

16. Branch Prediction in Different Domains

16.1 Embedded Systems

Resource-constrained embedded systems have simpler predictors:

Cortex-M series: Static or simple bimodal
Smaller BTBs (32-128 entries)
Lower misprediction penalties (shorter pipelines)

Embedded developers often:

Avoid complex control flow
Use lookup tables instead of branches
Manually unroll critical loops

16.2 High-Frequency Trading

HFT systems are extremely latency-sensitive:

Every nanosecond matters
Branch mispredictions are critical path
Custom hardware (FPGAs) avoids branches entirely

HFT optimizations:

// Branchless comparison
int cmp = (a > b) - (a < b);  // Returns -1, 0, or 1

// Branchless min/max
int min = b + ((a - b) & ((a - b) >> 31));

// Pre-computed decision tables
action = decision_table[state][event];

16.3 Scientific Computing

Scientific code is often predictable (regular loops over arrays):

! Matrix multiplication - highly predictable
do i = 1, n
    do j = 1, n
        do k = 1, n
            C(i,j) = C(i,j) + A(i,k) * B(k,j)
        end do
    end do
end do

But irregular data structures cause problems:

// Sparse matrix - unpredictable access patterns
for (int i = 0; i < nnz; i++) {
    if (col[i] == target_col) {  // Unpredictable!
        sum += val[i];
    }
}

Scientific libraries optimize sparse operations carefully.

16.4 Cryptographic Code

Cryptography requires constant-time execution to prevent timing attacks:

// WRONG: Timing depends on secret key bits
if (secret_key[i]) {
    result = operation_a();
}

// RIGHT: Always execute both, select result
result_a = operation_a();
result_b = operation_b();
mask = -(secret_key[i] & 1);  // All 1s or all 0s
result = (result_a & mask) | (result_b & ~mask);

Constant-time code deliberately avoids prediction-dependent timing.

17. Future Directions

17.1 Learned Predictors

Machine learning for prediction shows promise:

Neural network-based BTB indexing
Reinforcement learning for table management
Transfer learning from similar workloads

Challenges:

Inference latency (must be < 1 cycle)
Training complexity
Generalization to new workloads

17.2 Software-Hardware Co-design

Better compiler-hardware communication:

Richer branch hints from compilers
Hardware profiling feedback to compilers
Adaptive optimization based on runtime behavior

17.3 Security-First Predictors

Post-Spectre designs prioritize security:

Isolated prediction domains per security context
Speculative execution firewalls
Prediction state clearing on context switches

17.4 Heterogeneous Prediction

Different predictors for different branch types:

Simple predictor for loop branches
Neural predictor for data-dc14dc3 branches
Table-based for indirect branches

Dynamic selection based on branch characteristics.

18. Summary

Branch prediction is a cornerstone of modern CPU performance:

Deep pipelines require prediction to avoid stalls
Dynamic predictors learn from runtime behavior
Modern designs (TAGE, perceptron) achieve ~97% accuracy
Speculation enables out-of-order execution but creates security risks
Code patterns dramatically affect prediction accuracy

Key takeaways for developers:

Predictable branches are nearly free (~0 cycles overhead)
Unpredictable branches are expensive (10-25 cycles penalty)
Sorting data can dramatically improve prediction
Branchless code is only faster for unpredictable branches
Profile your code to find branch hotspots
Understand the architecture to write efficient code

The branch predictor is the CPU’s crystal ball, making educated guesses millions of times per second. Understanding how it works transforms how you think about the true cost of a simple if statement.

Every branch in your code is a bet. The CPU is gambling on the future, billions of times per second. Understanding these bets helps you write code that wins more often.