Superscalar Processors: Register Renaming, Reorder Buffers, and How Modern Cores Extract ILP

2023-09-28 · Leonardo Benicio

Summary

A microarchitectural deep dive into superscalar execution: register renaming, the reorder buffer, reservation stations, and the issue queue, examining how Haswell, M1, and Zen4 extract instruction-level parallelism from sequential code.

Contents

A modern high-performance processor core can execute four, six, or even eight instructions per clock cycle, despite the program being written as a sequential stream. This is the miracle of superscalar execution: the processor dynamically extracts instruction-level parallelism (ILP) from a sequential instruction stream by analyzing data dependencies, renaming registers to eliminate false dependencies, and executing independent instructions in parallel across multiple functional units. The result is that a single-threaded program runs as if it had been parallelized—but the “parallelizing compiler” is the processor itself, operating in hardware at nanosecond timescales.

This article is a deep dive into the machinery of superscalar execution: the frontend that fetches and decodes instructions, the register renaming engine that eliminates false dependencies, the reservation stations and issue queues that schedule instructions for out-of-order execution, the reorder buffer that ensures precise exceptions, and the memory disambiguation logic that speculatively reorders loads and stores. We will ground the discussion in three real microarchitectures: Intel’s Haswell (2013, the archetypal modern x86 core), Apple’s M1 Firestorm (2020, the widest commercial ARM core), and AMD’s Zen4 (2022, the highest-clocked x86 core).

1. The Instruction Pipeline: From In-Order to Out-of-Order

The classical five-stage RISC pipeline (fetch, decode, execute, memory, writeback) executes instructions strictly in program order. If instruction (I_1) produces a result needed by instruction (I_2), the pipeline stalls until (I_1) completes. This is the in-order execution model: instructions are issued and completed in program order, and the only parallelism comes from pipelining (overlapping different stages of different instructions).

Superscalar execution breaks this constraint. Instructions are still fetched and decoded in program order, but after decoding, they enter an out-of-order execution engine where they can issue for execution as soon as their operands are available, regardless of program order. A long-latency load that misses the cache does not block independent instructions that come after it in program order. The results of out-of-order execution are then reassembled in program order at the commit stage, preserving the illusion of sequential execution for the programmer.

The key structures that enable this transformation are:

Register renaming: Eliminates write-after-write (WAW) and write-after-read (WAR) false dependencies by mapping architectural registers to a larger pool of physical registers.
Reservation stations / Issue queue: Buffers that hold instructions waiting for their operands, waking up instructions when their operands become available.
Reorder buffer (ROB): A circular buffer that tracks all in-flight instructions in program order, holding their results until they can be committed in order.
Load-store queue (LSQ): Handles memory dependencies, ensuring that loads and stores to the same address are executed in the correct order.

2. Register Renaming: Eliminating False Dependencies

add r1, r2, r3    ; r1 = r2 + r3
mul r4, r1, r5    ; r4 = r1 * r5  (true dependence on r1)
add r1, r6, r7    ; r1 = r6 + r7  (WAW with first add, WAR with mul)
sub r8, r1, r9    ; r8 = r1 - r9  (true dependence on new r1)

There are three data dependencies involving r1:

RAW (read-after-write): The mul reads r1 produced by the first add. This is a true dependence: the mul must wait for the add.
WAW (write-after-write): The second add writes r1, overwriting the first add’s result.
WAR (write-after-read): The second add writes r1 after the mul reads it.

Without renaming, the second add cannot execute until the mul has read r1 (WAR), and the sub must wait for the second add (RAW). All four instructions are serialized.

With renaming, each write to an architectural register is mapped to a different physical register:

First add: allocates physical register p1, writes p1, maps r1 → p1.
mul: reads p1 (the mapping of r1 at this point).
Second add: allocates p2, writes p2, remaps r1 → p2.
sub: reads p2 (the new mapping of r1).

Now the second add and the sub can execute in parallel with the mul, because they use different physical registers. The only serialization is the RAW from the first add to the mul. WAW and WAR dependencies are eliminated.

2.1 The Rename Map Table and Free List

The register rename engine maintains:

A Register Alias Table (RAT) mapping each architectural register to the physical register that holds its most recent value.
A Free List of physical registers available for allocation.
A Retirement RAT (or architectural RAT) that records the mapping state at the current commit point, used to restore the RAT on branch misprediction.

On each instruction that writes a destination register:

Allocate a new physical register from the Free List.
Record the old mapping (the physical register being superseded) for later freeing.
Update the RAT: arch_reg → new_physical_reg.

On commit:

Free the old physical register (the one superseded by the committed instruction) back to the Free List.

On branch misprediction:

Restore the RAT from the Retirement RAT at the mispredicted branch’s checkpoint, effectively discarding all renames performed by instructions after the branch.

2.2 Physical Register File Sizes

The number of physical registers determines how many in-flight instructions can be supported. Haswell has 168 integer physical registers and 168 vector physical registers. Apple M1 Firestorm has roughly 354 integer physical registers and 384 vector physical registers—a massive register file that enables the core to keep hundreds of instructions in flight. Zen4 has 224 integer and 192 vector physical registers. These sizes are chosen based on the expected depth of the ROB (see below) so that physical registers are rarely the bottleneck.

3. The Reorder Buffer: In-Order Commit

The ROB is a circular buffer of entries, each representing one in-flight instruction. Instructions are allocated ROB entries in program order at the rename stage and deallocated in program order at the commit stage. The ROB serves three functions:

Precise exceptions: If an instruction faults (e.g., divide by zero, page fault), the ROB ensures that all instructions before the faulting instruction commit, and all instructions after it are discarded, restoring the precise architectural state at the faulting instruction.
Branch misprediction recovery: When a branch misprediction is detected, all instructions after the branch (identified by their ROB indices) are squashed, and the frontend is redirected to the correct path.
In-order retirement: Results from out-of-order execution are held in the ROB until all preceding instructions have committed, then written to the architectural register file in program order.

The ROB size limits the instruction window—the number of uncommitted instructions from which the processor can find independent work. Haswell’s ROB is 192 entries; M1 Firestorm’s is approximately 630 entries—the largest in any commercial processor, enabling it to extract ILP from extremely long dependency chains. Zen4’s ROB is 320 entries. A larger ROB is generally better for ILP, but it increases the complexity of the wakeup and selection logic (which must scan all ROB entries for ready instructions) and the physical register file pressure.

4. Reservation Stations and the Issue Queue

After renaming, instructions are dispatched to reservation stations (RS) or a unified issue queue (IQ). These are associative buffers where each entry holds an instruction waiting for its operands. Each entry tracks:

The instruction’s opcode and the functional unit it requires.
The values of its source operands (or, if not yet available, the physical register tags that will produce them).
A “ready” bit for each source operand.

When an instruction completes execution and produces a result, its physical register tag is broadcast on the result bus (also called the common data bus or wakeup bus). All reservation stations compare the broadcast tag against their pending source tags. If a match is found, the corresponding operand value is captured and the ready bit is set. When all operands are ready, the instruction becomes ready to issue.

A scheduler (or issue logic) selects ready instructions from the reservation stations and sends them to the appropriate functional units. The selection policy is typically oldest first among ready instructions (prioritizing instructions that have been waiting longest) to minimize the average latency and to reduce the likelihood of starvation.

4.1 Distributed vs. Unified Schedulers

Different microarchitectures make different choices about scheduler organization:

Haswell uses a unified scheduler with 60 entries shared by integer and floating-point instructions. This is simpler but requires the scheduler to be close to all functional units, which creates wire delays.
M1 Firestorm uses distributed reservation stations: each functional unit cluster has its own small scheduler (roughly 10-20 entries). Instructions are dispatched to the appropriate scheduler based on their opcode. This reduces wire delays and enables higher clock frequencies.
Zen4 uses a combination: an integer scheduler (96 entries) and a floating-point scheduler (64 entries), organized as unified queues within each domain but separate between domains.

The choice between unified and distributed scheduling is a fundamental microarchitectural tradeoff: unified schedulers provide better load balancing (any ready instruction can use any functional unit), while distributed schedulers provide better scalability (smaller structures, shorter wires, higher frequency).

5. The Frontend: Feeding the Beast

The out-of-order backend can only execute instructions that the frontend has fetched and decoded. If the frontend cannot supply instructions at the backend’s consumption rate, the extra execution width is wasted. The frontend is thus the critical path for sustained ILP.

5.1 Branch Prediction

The frontend’s effectiveness is gated by branch prediction accuracy. A single mispredicted branch squashes the entire pipeline and causes a bubble of 10-20 cycles (the mispredict penalty). Branch prediction in modern cores uses a hierarchy:

Branch Target Buffer (BTB): Predicts the target address of a taken branch based on the branch’s PC.
Direction predictor: Predicts whether a conditional branch is taken, using correlated predictors (TAGE, perceptron) that track long histories of branch outcomes.
Return Stack Buffer (RSB): Predicts return addresses by mirroring the call stack.
Indirect branch predictor: Predicts the target of indirect jumps (useful for virtual function calls and switch statements).

Haswell’s branch predictor achieves roughly 97-98% accuracy on SPEC CPU benchmarks. M1 Firestorm’s predictor is significantly larger (Apple has invested heavily in prediction accuracy, knowing that their 8-wide decode stage makes mispredict penalties especially large—roughly 15-20 cycles).

5.2 Instruction Fetch and Decode Width

The peak decode width (instructions per cycle decoded) determines the maximum sustained IPC. Haswell decodes 4 x86 instructions per cycle; M1 Firestorm decodes 8 ARM instructions per cycle; Zen4 decodes 4 x86 instructions per cycle but uses an op cache that can deliver up to 8 µops per cycle for hot code regions.

The x86 instruction set’s variable-length encoding makes high decode width challenging. Both Intel and AMD include a µop cache that caches decoded instructions (µops), bypassing the complex x86 decoder for frequently executed code. The µop cache hit rate is critical for frontend throughput; on typical workloads, it exceeds 80%.

6. Memory Disambiguation and the Load-Store Queue

Memory instructions (loads and stores) introduce dependencies that are unknown at rename time because the addresses are not yet computed. The load-store queue (LSQ) tracks all in-flight memory operations and performs memory disambiguation: determining whether a load can safely execute before an older store whose address is unknown.

The LSQ is divided into the load queue (LQ) and store queue (SQ). Stores are allocated SQ entries in program order and commit their data to the cache only at retirement (to support precise exceptions). Loads check the SQ for older stores to the same address; if a match is found, the load must either wait (if the store data is not yet available) or forward the data from the store (if the store data is ready). If no older store matches, the load can execute speculatively, assuming no aliasing.

Speculative loads are the source of Spectre v4 (Speculative Store Bypass), as discussed in the microarchitectural attacks article. The processor predicts that a load does not alias with an older store and executes the load speculatively; if the prediction is wrong, the load receives stale data. Hardware mitigations (SSBD) prevent the speculative forwarding but reduce memory-level parallelism.

7. Putting It All Together: The Life of an Instruction

To make the description concrete, let’s trace the life of an add instruction on Haswell:

Fetch (cycle 0): The instruction is fetched from the L1 instruction cache, using the predicted next PC from the branch predictor.
Pre-decode (cycle 1): The instruction length decoder identifies instruction boundaries in the variable-length x86 byte stream.
Decode (cycle 2): The x86 decoder translates the add into one or more µops—in this case, a single µop.
µop cache fill (cycle 2): The decoded µop is written into the µop cache for future reuse.
Allocate/Rename (cycle 3): The µop is allocated an ROB entry. Its destination register is renamed (a new physical register is allocated). Its source registers are looked up in the RAT to find the physical registers holding their values. If the sources are not yet ready (the producing instructions haven’t completed), the µop records the physical register tags of the producers.
Dispatch (cycle 3): The µop is dispatched to the unified scheduler, which holds it until its operands are ready.
Wait for operands (cycles 3-N): The µop sits in the scheduler, monitoring the result bus for broadcasts of its pending physical register tags. When all tags have been broadcast and the values captured, the µop becomes ready.
Issue (cycle N): The scheduler selects the µop for execution and sends it to an integer ALU.
Execute (cycle N+1): The ALU computes the sum. The result is written to the physical register file and broadcast on the result bus, waking up any µops waiting for this physical register.
Complete (cycle N+1): The µop’s completion status is written to its ROB entry.
Commit (cycle M, where M depends on the state of the ROB): When the µop reaches the head of the ROB and all preceding instructions have committed, the µop retires. Its destination architectural register is updated from the physical register. The old physical register (superseded by this instruction) is returned to the Free List.

The total latency from fetch to commit is at least 4 cycles (if operands are ready at dispatch and the ROB is empty), but typically 10-15 cycles for the instruction to traverse the full pipeline. The key point is that while this instruction is in flight, hundreds of other instructions are also in flight, and the execution units are kept busy by the continuous stream of ready instructions from the scheduler.

8. ILP Limits and the Memory Wall

How much ILP can a superscalar core actually extract? The theoretical limit is bounded by the instruction window size (ROB entries) and the available functional unit parallelism. In practice, the limit is set by data dependencies: every program has chains of dependent instructions (each instruction’s output is the next instruction’s input) that cannot be parallelized.

The critical path through a typical integer program has a dependence chain length of 5-10 instructions per cycle of work, limiting sustained IPC to 2-3 on integer code regardless of core width. Floating-point and vector code, which often has longer dependence chains and more independent operations (SIMD lanes), can sustain higher IPC (4-6 on well-optimized HPC code).

The ultimate limit, however, is the memory wall: cache misses expose the processor to DRAM latency (50-100 ns, or roughly 200-400 cycles at 4 GHz). During a cache miss, the out-of-order engine can execute independent instructions, but the ROB will eventually fill with dependent instructions waiting for the miss to resolve. At that point, the core stalls. Memory-level parallelism (MLP)—the ability to have multiple cache misses in flight simultaneously—is the primary determinant of performance on data-d31dd26 workloads. Haswell supports roughly 10 outstanding L1 cache misses; M1 Firestorm supports many more (estimates suggest 30-50), contributing to its superior performance on pointer-chasing workloads.

9. Precise Exceptions: Maintaining the Illusion of Sequential Execution

One of the most subtle challenges in out-of-order execution is handling exceptions and interrupts while preserving the precise exception model. A precise exception means that when an instruction faults (e.g., a page fault or division by zero), the architectural state visible to the exception handler must appear as if all instructions before the faulting instruction have executed and no instruction after it has executed. This is trivial on an in-order processor (just stop the pipeline) but profoundly difficult on an out-of-order processor where instructions execute in dataflow order and may complete long before older instructions.

The Checkpoint Mechanism. The standard solution is to checkpoint the architectural register state at regular intervals. When an instruction is dispatched, it carries a checkpoint identifier corresponding to the youngest checkpoint that precedes it. If the instruction faults, the processor restores the register map table from that checkpoint and flushes all instructions from the ROB that are younger than the checkpoint. The checkpoint granularity determines the recovery latency: more frequent checkpoints mean faster exception recovery but higher storage overhead. Modern designs typically checkpoint every branch (since branches are natural recovery points) or every N instructions (64-128). Each checkpoint stores a copy of the rename map table (~200 entries * 8 bits = ~200 bytes for a 192-register file), and storing 16-32 checkpoints costs a few kilobytes of SRAM.

The Physical Register File Reclaim Problem. A subtler issue arises with physical register recycling. When an exception triggers a checkpoint restore, the physical registers allocated to instructions after the checkpoint must be freed. The naive approach—scanning the ROB and freeing registers of flushed instructions—is too slow for a cycle-critical recovery path. Instead, the physical register file maintains a free list that supports checkpointed allocation: each checkpoint records which registers were allocated after it, and the restore operation bulk-frees those registers. This is implemented by maintaining a per-checkpoint allocation count and a linked list of allocated registers, or by using a retirement-based free list where registers are freed only when the writing instruction commits, not when it is flushed.

External Interrupts and the ROB Drain. External interrupts (timer interrupts, I/O interrupts) pose a different challenge: they arrive asynchronously and must be serviced promptly, but the processor may have hundreds of instructions in flight. The processor cannot simply stop and service the interrupt because the in-flight instructions have already modified microarchitectural state (cache, TLBs, branch predictors) that would be lost or corrupted. Instead, the processor sets an interrupt pending flag that is checked at commit time. When the oldest instruction in the ROB commits, if the interrupt flag is set, the processor diverts the fetch stage to the interrupt handler and drains the pipeline normally. The interrupt latency is therefore bounded by the ROB drain time—typically 50-200 cycles—which is acceptable for most interrupts but problematic for real-time systems. Real-time processors often use imprecise interrupt modes or limit the ROB size to bound interrupt latency.

The Meltdown Connection. The importance of precise exceptions became dramatically apparent with the Meltdown attack (Section 5 of the Speculative Execution article). Meltdown exploits the fact that on Intel processors prior to Cascade Lake, the permission check for a kernel-memory load occurs in parallel with the data access rather than before it. The result: a load that will eventually fault (because user code accessed kernel memory) speculatively forwards the kernel data to dependent instructions before the fault is recognized. The precise exception model is not violated—the fault eventually triggers and the architectural state is restored to the pre-fault checkpoint—but the microarchitectural state (specifically, the cache) has been modified by the speculatively executed dependent instructions, leaking the kernel data. Meltdown forced a reexamination of the precise exception implementation: the fix (KPTI) ensures that kernel pages are unmapped in user space, so the permission check happens before any data access, eliminating the speculative window entirely.

10. Value Prediction: Breaking True Data Dependencies

While register renaming eliminates false (WAR and WAW) dependencies, true (RAW) dependencies remain the fundamental limit on ILP. If every instruction depends on the result of the previous instruction, no amount of superscalar width or ROB capacity can extract parallelism. Value prediction attacks this limit by guessing the output of an instruction before it executes, allowing dependent instructions to proceed speculatively.

The Last-Value Predictor. The simplest value predictor, proposed by Lipasti and Shen (1996), records the last N values produced by each static instruction and predicts that the next execution will produce the most frequent value. For instructions that produce the same value most of the time (e.g., a branch that almost always computes the same target address, or a load that repeatedly reads the same global variable), the last-value predictor achieves accuracy above 80%. The prediction is verified when the instruction executes; a misprediction triggers a pipeline flush of all dependent instructions, similar to a branch misprediction.

Stride and Context-Based Predictors. Many instructions exhibit value locality with regular patterns. A loop induction variable (i++) produces a predictable sequence of values (1, 2, 3, …). A stride predictor detects this pattern and predicts the next value as last_value + stride. For nested loops, a context-based value predictor (Sazeides and Smith, 1997) uses a history of recent values as context to index into a prediction table, capturing more complex patterns like 1, 1, 2, 1, 1, 2, ....

The Practical Impact. Value prediction has been studied extensively in academia but has seen limited commercial adoption. Research by Perais and Seznec (2014) demonstrated that a 16 KB value predictor coupled with a confidence estimator can improve IPC by 15-25% on SPECint benchmarks while adding approximately 5% to core power consumption—a ratio that makes it attractive for performance-optimized designs but problematic for power-constrained ones. The primary obstacle is the cost: a value predictor requires substantial on-chip storage (the prediction tables), and a value misprediction is more expensive than a branch misprediction (because the mispredicted value may have been consumed by many dependent instructions, all of which must be flushed). The power cost of speculative execution on mispredicted values is also significant—every instruction executed with a wrong operand wastes dynamic energy. Intel’s research prototypes (the “Polymorphic Pipeline” concept) and academic designs have demonstrated 10-30% IPC improvement on integer benchmarks with value prediction, but the power overhead has so far prevented productization.

The Computational Pattern Classification. Value prediction is most effective for specific instruction categories. Constant-producing instructions (e.g., xor %rax, %rax) always produce zero and are trivially predictable. Stride-producing instructions (loop induction variables, address calculations) follow arithmetic progressions. Repeated-value instructions (loading the same global configuration flag across many iterations) exhibit high value locality. The key insight from the research literature is that these predictable instructions constitute 30-50% of all dynamic instructions in typical integer code, meaning a well-designed value predictor can eliminate a significant fraction of true data dependencies. The remaining 50-70% of instructions are genuinely unpredictable and set the hard limit on value prediction’s benefit.

Load Value Prediction in Practice. One form of value prediction that has seen commercial adoption is load value prediction for silent stores—stores that write the same value that is already in memory. On detecting a silent store, the processor can avoid the cache write, saving power and write bandwidth. ARM’s Cortex-A77 and later cores implement a form of silent store detection. More recently, Apple’s M1 and M2 Firestorm cores are rumored (based on reverse-engineering by Asahi Linux and third-party performance analysis) to implement a limited form of load value prediction for pointer-chasing workloads, where loads from linked data structures frequently return the same values due to structural locality. The exact mechanism is not publicly documented, but the performance characteristics suggest a modest-sized predictor that captures the most common value patterns for loads. Independent microbenchmarking by the performance community has demonstrated that Firestorm achieves load-to-use latencies for predictable pointer chains that are 2-3 cycles faster than what naive out-of-order execution would permit.

11. The Power and Area Cost of Superscalar Design: A Quantitative Perspective

Superscalar execution is expensive—not in dollars per chip (transistors are cheap) but in power and design complexity. Understanding the cost breakdown explains why embedded and efficiency cores (ARM Cortex-A5x series, Intel E-cores) are much narrower than performance cores.

Area Breakdown. For a modern out-of-order core like the ARM Cortex-X3 (a 6-wide design), the approximate area distribution is:

Frontend (fetch, decode, branch predictor): ~15% of core area. The branch predictor, with its multi-level TAGE and perceptron components, can occupy 5-10% of core area alone.
Rename and dispatch: ~10%. The rename map table and free list are relatively compact SRAM structures.
ROB and physical register file: ~25%. The ROB (200-300 entries, each tracking 8-16 fields) and the PRF (200-300 entries, each 64-128 bits for integer, 128-512 bits for vector) are the largest on-core structures. The PRF’s multiported design (8-12 read ports, 4-8 write ports) dominates the area cost because each additional port adds roughly 10-15% to the SRAM area.
Scheduler (reservation stations): ~20%. The scheduler is implemented as a content-addressable memory (CAM) that matches producing instructions to consuming instructions. Each entry holds the instruction opcode, two operand tags, two operand values (if ready), and a wakeup tag. The wakeup logic—comparing every completed result tag against every waiting instruction’s operand tags—is a quadratic-cost operation: for an N-entry scheduler, the wakeup requires O(N * issue_width) comparisons per cycle.
Execution units: ~15%. The ALUs, FPUs, and vector units themselves are relatively compact compared to the scheduling infrastructure that feeds them.
Load-store unit (LSQ, L1 caches): ~15%. The L1 data cache (32-64 KB) and the load-store queue (100-200 entries) are the main consumers.

Power Breakdown. The dynamic power consumption tells a different story:

Clock distribution and pipeline registers: ~30% of core power. Every pipeline stage boundary is a set of flip-flops, and clocking them consumes power proportional to the number of stages and the flip-flop count.
Scheduler wakeup and select logic: ~15-20%. The CAM wakeup is particularly power-hungry because it evaluates many tag comparisons in parallel every cycle.
Register file reads and writes: ~15%. The PRF’s multiported SRAM is a major power consumer due to the high number of simultaneous accesses.
Branch predictor: ~5-10%. The large SRAM tables (TAGE, perceptron weights) are accessed every cycle.
Execution units: ~15%. The actual computation (ALU, FPU) is surprisingly power-efficient compared to the scheduling overhead.
Cache accesses: ~10%. L1 cache accesses are power-efficient due to small size and optimized SRAM design.

The Efficiency Core Alternative. This cost structure explains why efficiency cores (Intel Gracemont E-core, ARM Cortex-A510) adopt a fundamentally different design philosophy:

In-order or narrow OoO: Gracemont is a dual-3-wide in-order cluster; Cortex-A510 is a narrow 3-wide OoO with a small ROB (100 entries). This eliminates the scheduler CAM and reduces the PRF port count (3-4 read ports instead of 8-12).
Shared frontend: Multiple E-cores share a single fetch and decode unit, amortizing the branch predictor and instruction cache cost across cores.
Power efficiency: An E-core delivers roughly 40-50% of the single-threaded performance of a P-core at 20-30% of the power, yielding superior performance-per-watt for throughput-oriented workloads.

The superscalar design space is thus a Pareto frontier: wider cores extract more ILP but consume disproportionately more power. This frontier has shifted over time as transistor budgets have grown—a 6-wide core in 2010 consumed an unsustainable fraction of the chip’s power budget; by 2025, the same 6-wide design fits comfortably within the power envelope of a laptop processor. The frontier continues to advance, but the diminishing returns of additional width (Amdahl’s Law applied to ILP) suggest that the practical limit for single-threaded general-purpose code lies somewhere between 8 and 12 wide, beyond which the additional power and area yield negligible performance gains. The future of superscalar performance is therefore not in ever-wider cores but in heterogeneous designs that match the core width to the workload’s available ILP, and in architectural innovations (like value prediction and speculative memory disambiguation) that increase the effective ILP within a given width. The optimal width for a given workload depends on the ILP available in the code and the power budget. Server workloads (databases, web servers) with abundant ILP benefit from wide cores; mobile workloads (UI rendering, media playback) benefit from efficiency cores; and desktop/laptop workloads sit in between, making heterogeneous designs (P-cores + E-cores) the dominant approach in the 2020s.

12. Branch Prediction: TAGE, Perceptron, and the Limits of ILP

No discussion of superscalar execution is complete without a deep treatment of branch prediction—the mechanism that keeps the frontend supplied with instructions across conditional branches. A modern superscalar core with a 512-entry reorder buffer may have 200+ instructions in flight at any moment; a single branch misprediction flushes the pipeline and wastes the work equivalent to 15-20 cycles of execution. Branch prediction accuracy is therefore the single most important determinant of superscalar efficiency.

12.1 The TAGE Predictor: TAgged GEometric History Lengths

The TAGE predictor (Seznec and Michaud, JILP 2006; winner of the Championship Branch Prediction competition for multiple generations) has been the dominant predictor in high-performance cores since Intel’s Sandy Bridge (2011). TAGE organizes predictions into multiple tables, each indexed by a hash of the program counter and a geometric history length—table (T_i) uses the last (L_i = \alpha \cdot \beta^i) branch outcomes (typically (\alpha = 2, \beta = 2)), so table 0 uses 2 bits of history, table 1 uses 4, table 2 uses 8, and so on up to table 8 using 512 bits.

The key insight is that different branches are predictable at different history lengths. A loop branch that alternates every iteration needs only short history; a correlated branch that depends on the outcome of a distant earlier branch needs long history. TAGE automatically selects the appropriate history length by tagging each table entry with a partial tag of the branch’s PC and history, and selecting the prediction from the longest-history table that matches.

A base predictor (a simple bimodal table indexed by PC alone) provides predictions for branches with no history correlation. The total storage budget for a modern TAGE predictor is approximately 32-64 KB, organized as 8-12 tables with geometrically increasing sizes. The prediction accuracy on SPEC CPU 2017 benchmarks averages 97-99% (mispredictions per thousand instructions, MPKI, of 3-10), with the worst-case benchmarks (like gcc and perlbench, which have data-dc14dc3 branches that are fundamentally unpredictable) achieving 92-95%.

12.2 The Perceptron Predictor: Neural-Inspired Branch Prediction

AMD’s Zen microarchitecture uses a perceptron predictor, which is fundamentally different from TAGE. Instead of storing discrete prediction counters, the perceptron predictor stores a vector of weights (w_0, w_1, \ldots, w_n) for each branch. The prediction is computed as:

[ \text{predict} = \text{sign}\left(w0 + \sum{i=1}^{n} w_i \cdot x_i\right) ]

where (x_i \in {-1, 1}) are the features—the outcomes of the last (n) branches (1 for taken, -1 for not taken) and the bits of the branch’s PC. The weights are updated using the perceptron learning rule: if the prediction was correct, no update; if incorrect, (w_i \leftarrow w_i + y \cdot x_i) (where (y) is the true outcome).

The perceptron predictor’s advantage is that it can learn non-linear correlations that TAGE tables cannot represent. For instance, a branch whose outcome is the XOR of two previous branch outcomes is linearly inseparable—a single-layer perceptron cannot predict it, but TAGE cannot represent it either unless it has a table indexed by exactly those two outcomes. In practice, the perceptron predictor achieves comparable accuracy to TAGE on SPEC benchmarks but with a simpler, more regular hardware structure that is easier to implement in high-frequency designs.

Zen 4’s perceptron predictor uses approximately 12 KB of storage for the weight tables (compared to ~40 KB for Intel’s TAGE in Golden Cove), demonstrating that the perceptron approach can achieve competitive accuracy with lower area and power. The tradeoff is that the perceptron requires a dot-product computation (n multiplications and additions) on each prediction, which is more latency-critical than TAGE’s table lookup. AMD amortizes this latency by predicting branches in batches and pipelining the dot-product computation across multiple cycles.

12.3 The Fundamental Limits of Branch Prediction

Not all branches are predictable. A branch whose outcome is a cryptographic hash of program state is fundamentally unpredictable: any predictor, no matter how sophisticated, achieves at most 50% accuracy on such branches because the outcome is computationally indistinguishable from random. The CMOV (conditional move) instruction was introduced specifically to handle such branches without prediction: instead of branching, both values are computed, and the correct one is selected based on the condition. Modern ISAs (x86-64, ARM64) include a rich set of conditional instructions (cmov, csel, cset) that eliminate branches from data-dc14dc3 selection, at the cost of computing both values.

For branches that are predictable, the limit is set by the predictability ceiling of the program: the fraction of branches whose outcomes are deterministic functions of the program’s execution history. The predictability ceiling varies from ~85% for irregular, data-dc14dc3 workloads (like database query processing) to ~99.9% for scientific computing loops with regular iteration patterns. The gap between the predictability ceiling and actual predictor accuracy (typically 95-99%) represents the room for improvement in branch predictor design—a gap that has narrowed from ~10% in 2000 to ~2-3% in 2024, suggesting that we are approaching the practical limits of branch prediction.

13. SMT and Superscalar Efficiency: When Two Threads Share One Core

Simultaneous Multithreading (SMT) is the natural extension of superscalar design to multithreaded workloads. An SMT core issues instructions from multiple hardware threads in the same cycle, filling issue slots that would otherwise go unused due to ILP limitations within a single thread. Understanding SMT’s interaction with superscalar resources is essential for performance analysis of modern server and cloud processors.

13.1 SMT Resource Sharing: Fetch, Issue, Execute, and Commit

In an SMT core (Intel Hyper-Threading, IBM POWER SMT4/SMT8), the frontend alternates fetch cycles between threads (or fetches from multiple threads simultaneously in wider designs). The fetched instructions are tagged with a thread ID and enter the shared rename stage, where they compete for physical registers and reorder buffer entries. The issue queue and reservation stations are shared, with each thread’s instructions competing for issue slots based on operand readiness. The execution units are oblivious to thread ID: a multiplier can execute an instruction from thread A in one cycle and thread B in the next. The commit stage commits instructions in program order per thread, with separate retirement register files (RRFs) logically multiplexed onto the physical register file.

The key resource contention points are:

Fetch bandwidth: If one thread stalls on an I-cache miss, the other thread(s) can consume the full fetch bandwidth, maintaining utilization.
Issue queue entries: Long-latency loads from one thread consume issue queue entries without producing result-ready instructions; the other thread’s instructions can fill these slots.
Physical registers: The physical register file is statically or dynamically partitioned between threads; a thread with high register pressure competes with sibling threads.
Cache capacity: L1 instruction and data caches are shared; threads with large working sets evict each other’s cache lines, potentially reducing overall throughput.

13.2 SMT Performance Gains and Their Saturation

The performance gain from SMT varies dramatically by workload. For workloads with low ILP (e.g., database OLTP, web serving—dominated by cache misses and branch mispredictions), SMT can improve throughput by 30-50% (1.3-1.5x per core over single-thread mode). For workloads with high ILP (e.g., HPC matrix multiplication, video encoding—dominated by long sequences of independent arithmetic), SMT provides minimal gain (5-10%) because the single thread already saturates the execution units. For mixed workloads, SMT provides load balancing: a latency-critical thread gets the resources it needs while a throughput-oriented background thread fills the gaps.

The saturation of SMT gains with additional threads is governed by Amdahl’s law in the microarchitectural domain. With 2 threads (SMT2, standard on Intel and AMD), the gain is typically 1.2-1.5x. With 4 threads (SMT4, IBM POWER9), the gain over SMT2 is an additional 1.1-1.2x, and with 8 threads (SMT8, IBM POWER8), the incremental gain over SMT4 is 1.05-1.1x—diminishing returns as the core’s execution resources become fully saturated. The reason is that the superscalar core has a finite number of execution units, and once the issue slots are fully packed (which happens at 2-4 threads for most workloads), additional threads only increase contention for cache and registers without increasing throughput. The practical limit for general-purpose workloads appears to be SMT4, and most designs have converged on SMT2 as the cost-effective sweet spot.

13.3 SMT Side Channels and Security Implications

SMT’s resource sharing is also a security vulnerability, as discussed in the Spectre/Meltdown article. Two threads sharing a physical core can observe each other’s cache access patterns (via Prime+Probe on shared L1 and L2 caches), branch predictor state (via aliasing in the BTB), and—on some Intel implementations—stale data in the line-fill buffer (via MDS-class attacks). Cloud providers now offer “SMT off” options for security-sensitive workloads, accepting the 20-40% throughput reduction in exchange for eliminating the SMT side channel. This creates a microarchitectural version of the classic performance-security tradeoff: SMT improves throughput at the cost of cross-thread information leakage.

14. Summary

Superscalar execution is the central achievement of high-performance processor design. It transforms a sequential instruction stream into a dynamically scheduled parallel execution, extracting ILP the compiler could not express (because the parallelism is data-dc14dc3 and only discoverable at runtime). The machinery—register renaming, reservation stations, reorder buffer, load-store queue, branch prediction—is intricate, but the principle is elegant: execute instructions as soon as their inputs are available, and reconstruct the illusion of sequential order only at the end.

The three microarchitectures we examined—Haswell, M1 Firestorm, and Zen4—represent different points in the design space. Haswell is balanced and mature, the culmination of Intel’s Core microarchitecture lineage. M1 Firestorm is extraordinarily wide (8-wide decode, ~630-entry ROB), optimized for sustained ILP at the cost of die area and power. Zen4 is frequency-optimized, with a 4-wide decode and a deep pipeline that enables 5.7 GHz clock speeds, trading per-cycle throughput for clock speed.

The superscalar era is not over. While clock frequency scaling has mostly stopped, the transistor budget continues to grow, enabling wider cores, larger instruction windows, and more sophisticated predictors. The frontier is in extracting more ILP from single threads while keeping the power and complexity manageable—a challenge that will occupy microarchitects for the foreseeable future.