CPU Caches and Cache Coherence: The Memory Hierarchy That Makes Modern Computing Fast

2022-07-12 · Leonardo Benicio

A comprehensive exploration of how CPU caches bridge the processor-memory speed gap. Learn about cache architecture, replacement policies, coherence protocols, and how to write cache-friendly code for maximum performance.

Modern CPUs can execute billions of instructions per second, but main memory takes hundreds of cycles to respond. Without caches, processors would spend most of their time waiting for data. The cache hierarchy is one of the most important innovations in computer architecture, and understanding it is essential for writing high-performance software.

1. The Memory Wall Problem

The fundamental challenge that caches solve.

1.1 The Speed Gap

Component          Access Time     Relative Speed
─────────────────────────────────────────────────
CPU Register       ~0.3 ns         1x (baseline)
L1 Cache           ~1 ns           ~3x slower
L2 Cache           ~3-4 ns         ~10x slower
L3 Cache           ~10-20 ns       ~30-60x slower
Main Memory        ~50-100 ns      ~150-300x slower
NVMe SSD           ~20,000 ns      ~60,000x slower
HDD                ~10,000,000 ns  ~30,000,000x slower

1.2 Why the Gap Exists

Memory technology tradeoffs:

SRAM (caches):
├─ Fast: 6 transistors per bit
├─ Expensive: ~100x cost per bit vs DRAM
├─ Power hungry
└─ Low density

DRAM (main memory):
├─ Slow: 1 transistor + 1 capacitor per bit
├─ Cheap: high density
├─ Needs refresh (capacitors leak)
└─ Better power per bit

1.3 The Solution: Caching

Caches exploit two key principles:

Temporal Locality:
"If you accessed it recently, you'll probably access it again"
Example: Loop counter variable

Spatial Locality:
"If you accessed this address, you'll probably access nearby addresses"
Example: Sequential array traversal

2. Cache Architecture Fundamentals

How caches are organized internally.

2.1 Cache Lines

Caches don’t store individual bytes—they store fixed-size blocks:

Typical cache line: 64 bytes

Memory address: 0x1234_5678
                └──────┴──┘
                  Tag   │
                        └─ Offset within cache line (6 bits for 64B)

When you read one byte, the entire 64-byte line is loaded:
┌────────────────────────────────────────────────────────────────┐
│ Cache Line (64 bytes)                                          │
│ [byte0][byte1][byte2]...[byte63]                              │
└────────────────────────────────────────────────────────────────┘

2.2 Cache Organization Types

1. Direct-Mapped Cache:
   Each memory address maps to exactly one cache location
   
   Pros: Simple, fast lookup
   Cons: Conflict misses (two addresses compete for same slot)
   
   Memory Address → Hash → Single Cache Location
   
2. Fully Associative Cache:
   Any memory address can go in any cache location
   
   Pros: No conflict misses
   Cons: Expensive to search all entries
   
   Memory Address → Search All → Any Cache Location
   
3. Set-Associative Cache (most common):
   Address maps to a set; can go in any slot within that set
   
   4-way set associative: 4 slots per set
   8-way set associative: 8 slots per set
   
   Memory Address → Hash → Set → One of N Slots

2.3 Anatomy of a Cache Entry

┌─────────────────────────────────────────────────────────────────┐
│                        Cache Entry                               │
├──────┬───────┬───────┬─────────────────────────────────────────┤
│Valid │ Dirty │  Tag  │              Data (Cache Line)          │
│ (1b) │ (1b)  │(~30b) │                (64 bytes)               │
├──────┼───────┼───────┼─────────────────────────────────────────┤
│  1   │   0   │ 0x1A3 │ [64 bytes of data from memory]          │
└──────┴───────┴───────┴─────────────────────────────────────────┘

Valid: Is this entry holding real data?
Dirty: Has this data been modified (needs writeback)?
Tag: High bits of address for matching
Data: The actual cached bytes

2.4 Address Breakdown

For a 32KB, 8-way set associative cache with 64-byte lines:

Total entries: 32KB / 64B = 512 entries
Sets: 512 / 8 = 64 sets
Bits needed for set index: log2(64) = 6 bits

Address breakdown (for 48-bit virtual address):
┌────────────────────────┬────────┬────────┐
│         Tag            │  Set   │ Offset │
│      (36 bits)         │(6 bits)│(6 bits)│
└────────────────────────┴────────┴────────┘

Offset: Which byte within the 64-byte line
Set: Which set to look in
Tag: For matching within the set

3. The Cache Hierarchy

Modern CPUs have multiple cache levels.

3.1 Typical Modern Configuration

                    ┌─────────────┐
                    │   CPU Core  │
                    └──────┬──────┘
                    ┌──────▼──────┐
                    │  L1 I-Cache │ 32-64 KB, ~4 cycles
                    │  L1 D-Cache │ 32-64 KB, ~4 cycles
                    └──────┬──────┘
                    ┌──────▼──────┐
                    │   L2 Cache  │ 256-512 KB, ~12 cycles
                    └──────┬──────┘
              ┌────────────▼────────────┐
              │        L3 Cache         │ 8-64 MB, ~40 cycles
              │    (Shared across cores)│
              └────────────┬────────────┘
              ┌────────────▼────────────┐
              │      Main Memory        │ ~200 cycles
              └─────────────────────────┘

3.2 Inclusive vs Exclusive Hierarchies

Inclusive (Intel):
- L3 contains copy of everything in L1/L2
- Simpler coherence
- Wastes some capacity

L1: [A, B, C]
L2: [A, B, C, D, E]
L3: [A, B, C, D, E, F, G, H]

Exclusive (AMD):
- Each level holds unique data
- Better capacity utilization  
- More complex coherence

L1: [A, B]
L2: [C, D, E]
L3: [F, G, H, I, J]

Non-Inclusive Non-Exclusive (NINE):
- L3 doesn't guarantee inclusion
- Flexible eviction policies
- Modern Intel uses this

3.3 Private vs Shared Caches

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Core 0  │ │ Core 1  │ │ Core 2  │ │ Core 3  │
├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤
│L1 (Priv)│ │L1 (Priv)│ │L1 (Priv)│ │L1 (Priv)│
├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤
│L2 (Priv)│ │L2 (Priv)│ │L2 (Priv)│ │L2 (Priv)│
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
     │           │           │           │
     └───────────┴─────┬─────┴───────────┘
              ┌────────▼────────┐
              │  L3 (Shared)    │
              └─────────────────┘

Private caches: Low latency, no sharing overhead
Shared caches: Better utilization, requires coherence

4. Cache Replacement Policies

When the cache is full, which line gets evicted?

4.1 Common Policies

LRU (Least Recently Used):
- Evict the line accessed longest ago
- Optimal for many workloads
- Expensive to implement exactly

Pseudo-LRU:
- Approximates LRU with less hardware
- Tree-based tracking
- Good enough in practice

Random:
- Surprisingly effective
- Simple to implement
- Avoids pathological patterns

RRIP (Re-Reference Interval Prediction):
- Intel's modern approach
- Predicts reuse distance
- Handles scan-resistant workloads

4.2 LRU Implementation

True LRU for 4-way associative:
- Need to track order of 4 elements
- 4! = 24 states = 5 bits per set
- Update on every access

Tree-PLRU (Pseudo-LRU):
         [0]              Root bit
        /   \
      [1]   [2]           Level 1 bits
      / \   / \
     W0 W1 W2 W3          Cache ways

On access to W1: Set root→left, level1-left→right
On eviction: Follow bits to find victim

Only 3 bits per set (vs 5 for true LRU)

4.3 Replacement Policy Impact

// Different access patterns favor different policies

// Sequential scan (LRU performs poorly):
for (int i = 0; i < HUGE_ARRAY; i++) {
    sum += array[i];  // Each line used once, evicted before reuse
}

// Working set that fits in cache (LRU works well):
for (int iter = 0; iter < 1000; iter++) {
    for (int i = 0; i < CACHE_SIZE; i++) {
        sum += array[i];  // Lines stay in cache
    }
}

// Random access (all policies similar):
for (int i = 0; i < N; i++) {
    sum += array[random_indices[i]];
}

5. Cache Coherence

When multiple cores have caches, how do we keep them consistent?

5.1 The Coherence Problem

Initial state: memory[X] = 0

Core 0 L1:  [X = 0]     Core 1 L1:  [X = 0]
            
Core 0 writes X = 1:
Core 0 L1:  [X = 1]     Core 1 L1:  [X = 0]  ← STALE!

Without coherence:
- Core 1 reads stale data
- Program behaves incorrectly
- Multithreaded code breaks

5.2 MESI Protocol

The most common cache coherence protocol:

States:
┌───────────────┬────────────────────────────────────────────┐
│ State         │ Meaning                                    │
├───────────────┼────────────────────────────────────────────┤
│ Modified (M)  │ Only copy, dirty (different from memory)   │
│ Exclusive (E) │ Only copy, clean (matches memory)          │
│ Shared (S)    │ Multiple copies may exist, clean           │
│ Invalid (I)   │ Not valid, must fetch from elsewhere       │
└───────────────┴────────────────────────────────────────────┘

5.3 MESI State Transitions

            ┌─────────────────────────────────────────┐
            │                                         │
            ▼                                         │
    ┌───────────┐  Read hit    ┌───────────┐         │
    │           │◄────────────►│           │         │
    │  Invalid  │              │  Shared   │─────────┤
    │    (I)    │              │    (S)    │ Write   │
    └─────┬─────┘              └─────┬─────┘ (upgrade)
          │                          │               │
          │ Read miss                │ Other core    │
          │ (no other copy)          │ writes        │
          ▼                          ▼               │
    ┌───────────┐              ┌───────────┐         │
    │ Exclusive │──────────────│ Modified  │◄────────┘
    │    (E)    │    Write     │    (M)    │
    └───────────┘              └───────────┘

5.4 Coherence Operations

Scenario: Core 0 has line in M state, Core 1 wants to read

1. Core 1 issues read request on bus
2. Core 0 snoops the bus, sees request for its line
3. Core 0 provides data (cache-to-cache transfer)
4. Core 0 transitions M → S
5. Core 1 receives data in S state
6. Memory may or may not be updated (depends on protocol variant)

This is called a "snoop" or "intervention"

5.5 MOESI and MESIF Extensions

MOESI (AMD):
- Adds Owned (O) state
- Owner provides data, memory not updated
- Reduces memory traffic

MESIF (Intel):
- Adds Forward (F) state  
- One cache designated to respond
- Reduces duplicate responses

Example with Owned state:
Core 0: Modified [X = 5]
Core 1: Read request
Result: Core 0 → Owned, Core 1 → Shared
Memory still has old value (only owner has current)

6. False Sharing

A critical performance pitfall in multithreaded code.

6.1 The Problem

// Looks innocent...
struct Counters {
    long counter0;  // 8 bytes
    long counter1;  // 8 bytes
};

struct Counters counters;

// Thread 0:
void thread0() {
    for (int i = 0; i < 1000000; i++) {
        counters.counter0++;  // Only touches counter0
    }
}

// Thread 1:
void thread1() {
    for (int i = 0; i < 1000000; i++) {
        counters.counter1++;  // Only touches counter1
    }
}

// But both counters are in the SAME cache line!
// Every write invalidates the other core's cache
// Result: 10-100x slower than expected

6.2 Visualizing False Sharing

Cache line (64 bytes):
┌────────────────┬────────────────┬─────────────────────────────┐
│   counter0     │   counter1     │         (padding)           │
│   (8 bytes)    │   (8 bytes)    │        (48 bytes)           │
└────────────────┴────────────────┴─────────────────────────────┘
        ▲                ▲
        │                │
     Thread 0         Thread 1
     writes           writes

Every write:
1. Writer invalidates other core's cache line
2. Other core must fetch updated line
3. Then it writes and invalidates the first core
4. Ping-pong of cache line between cores

6.3 The Solution: Padding

#define CACHE_LINE_SIZE 64

struct Counters {
    alignas(CACHE_LINE_SIZE) long counter0;
    alignas(CACHE_LINE_SIZE) long counter1;
};

// Or with explicit padding:
struct Counters {
    long counter0;
    char padding0[CACHE_LINE_SIZE - sizeof(long)];
    long counter1;
    char padding1[CACHE_LINE_SIZE - sizeof(long)];
};

// C++17 hardware_destructive_interference_size:
#include <new>
struct Counters {
    alignas(std::hardware_destructive_interference_size) long counter0;
    alignas(std::hardware_destructive_interference_size) long counter1;
};

6.4 Measuring False Sharing

# Linux perf can detect cache line contention
perf c2c record ./program
perf c2c report

# Output shows:
# - Cachelines with high contention
# - Which loads/stores conflict
# - HITM (Hit Modified) events

7. Writing Cache-Friendly Code

Practical techniques for better cache utilization.

7.1 Sequential Access Patterns

// Good: Sequential access (spatial locality)
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];  // Next element likely in same cache line
}

// Bad: Strided access
for (int i = 0; i < N; i += 16) {
    sum += array[i];  // Only use 1/16 of each cache line
}

// Terrible: Random access
for (int i = 0; i < N; i++) {
    sum += array[rand() % N];  // No locality
}

7.2 Loop Ordering for Multi-Dimensional Arrays

#define ROWS 1000
#define COLS 1000
int matrix[ROWS][COLS];  // Row-major in C

// Good: Row-major traversal (matches memory layout)
for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];  // Sequential in memory
    }
}

// Bad: Column-major traversal (cache thrashing)
for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];  // Stride of COLS * sizeof(int)
    }
}

// Performance difference: often 10-50x!

7.3 Structure Layout Optimization

// Bad: Poor cache utilization
struct Bad {
    char a;      // 1 byte + 7 padding
    double b;    // 8 bytes
    char c;      // 1 byte + 7 padding
    double d;    // 8 bytes
};  // Total: 32 bytes, only 18 used

// Good: Grouped by size
struct Good {
    double b;    // 8 bytes
    double d;    // 8 bytes
    char a;      // 1 byte
    char c;      // 1 byte + 6 padding
};  // Total: 24 bytes, 18 used

// Best for hot/cold: Separate structures
struct Hot {
    double b;
    double d;
};

struct Cold {
    char a;
    char c;
};

7.4 Data-Oriented Design

// Object-Oriented (cache-unfriendly for bulk operations):
struct Entity {
    float x, y, z;      // Position
    float vx, vy, vz;   // Velocity
    float health;
    char name[32];
    int id;
    // ... more fields
};
Entity entities[10000];

// Update positions: loads entire struct, uses only 24 bytes
for (int i = 0; i < 10000; i++) {
    entities[i].x += entities[i].vx;
    entities[i].y += entities[i].vy;
    entities[i].z += entities[i].vz;
}

// Data-Oriented (cache-friendly):
struct Positions { float x[10000], y[10000], z[10000]; };
struct Velocities { float vx[10000], vy[10000], vz[10000]; };

Positions pos;
Velocities vel;

// Update positions: sequential access, full cache line utilization
for (int i = 0; i < 10000; i++) {
    pos.x[i] += vel.vx[i];
}
for (int i = 0; i < 10000; i++) {
    pos.y[i] += vel.vy[i];
}
for (int i = 0; i < 10000; i++) {
    pos.z[i] += vel.vz[i];
}

7.5 Blocking (Loop Tiling)

// Matrix multiply without blocking
// Each pass through B column thrashes cache
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

// Matrix multiply with blocking
// Process cache-sized blocks
#define BLOCK 64  // Fits in L1 cache

for (int ii = 0; ii < N; ii += BLOCK) {
    for (int jj = 0; jj < N; jj += BLOCK) {
        for (int kk = 0; kk < N; kk += BLOCK) {
            // Mini matrix multiply on cached blocks
            for (int i = ii; i < min(ii+BLOCK, N); i++) {
                for (int j = jj; j < min(jj+BLOCK, N); j++) {
                    for (int k = kk; k < min(kk+BLOCK, N); k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

8. Cache Prefetching

Bringing data into cache before it’s needed.

8.1 Hardware Prefetching

Modern CPUs detect patterns and prefetch automatically:

Patterns detected:
- Sequential: array[0], array[1], array[2]...
- Strided: array[0], array[4], array[8]...
- Some complex patterns on modern CPUs

Hardware prefetcher limitations:
- Can't cross page boundaries (4KB)
- Limited number of streams tracked
- Irregular patterns not detected

8.2 Software Prefetching

#include <xmmintrin.h>  // For _mm_prefetch

void process_array(int *data, int n) {
    for (int i = 0; i < n; i++) {
        // Prefetch data for future iterations
        _mm_prefetch(&data[i + 16], _MM_HINT_T0);  // L1 cache
        
        // Process current element
        process(data[i]);
    }
}

// Prefetch hints:
// _MM_HINT_T0: Prefetch to L1 (and all levels)
// _MM_HINT_T1: Prefetch to L2 (and L3)
// _MM_HINT_T2: Prefetch to L3 only
// _MM_HINT_NTA: Non-temporal (don't pollute cache)

8.3 When Prefetching Helps

// Prefetching helps: Predictable but non-sequential access
void linked_list_traverse(Node *head) {
    Node *current = head;
    while (current) {
        // Prefetch next node while processing current
        if (current->next) {
            _mm_prefetch(current->next, _MM_HINT_T0);
        }
        process(current);
        current = current->next;
    }
}

// Prefetching hurts: Already sequential (hardware handles it)
for (int i = 0; i < N; i++) {
    _mm_prefetch(&array[i+16], _MM_HINT_T0);  // Wasteful
    sum += array[i];  // Hardware prefetcher already doing this
}

9. Cache Performance Metrics

Measuring and understanding cache behavior.

9.1 Key Metrics

Hit Rate = Hits / (Hits + Misses)
Miss Rate = 1 - Hit Rate

MPKI = Misses Per Kilo Instructions

Types of misses (the "3 Cs"):
- Compulsory: First access to a line (cold miss)
- Capacity: Working set exceeds cache size
- Conflict: Multiple addresses map to same set

9.2 Using Performance Counters

# Linux perf for cache statistics
perf stat -e L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./program

# Example output:
# 1,234,567,890  L1-dcache-loads
#    12,345,678  L1-dcache-load-misses  # 1% miss rate
#    12,345,000  LLC-loads
#       123,456  LLC-load-misses        # 1% of L1 misses hit memory

9.3 Cache Miss Visualization

# Cachegrind for detailed cache simulation
valgrind --tool=cachegrind ./program
cg_annotate cachegrind.out.*

# Output shows per-line cache behavior:
# Ir      I1mr    ILmr    Dr       D1mr    DLmr    Dw       D1mw    DLmw
# 1000000    0       0   1000000  250000      0       0        0       0
#   for (int i = 0; i < n; i += 64) sum += array[i];

# Dr: Data reads
# D1mr: L1 data read misses
# DLmr: Last-level cache read misses

9.4 Working Set Analysis

// Determine effective working set size
// by measuring performance vs array size

#include <time.h>

void measure_cache_sizes() {
    for (int size = 1024; size <= 64*1024*1024; size *= 2) {
        char *array = malloc(size);
        
        clock_t start = clock();
        
        // Random accesses within array
        for (int i = 0; i < 10000000; i++) {
            array[rand() % size]++;
        }
        
        clock_t end = clock();
        double time = (double)(end - start) / CLOCKS_PER_SEC;
        
        printf("Size: %8d KB, Time: %.3f s\n", size/1024, time);
        free(array);
    }
}

// Output shows jumps at cache boundaries:
// Size:       1 KB, Time: 0.150 s  ← Fits in L1
// Size:       2 KB, Time: 0.151 s
// ...
// Size:      32 KB, Time: 0.155 s  ← L1 boundary
// Size:      64 KB, Time: 0.280 s  ← Falls out of L1
// ...
// Size:     256 KB, Time: 0.290 s  ← L2 boundary
// Size:     512 KB, Time: 0.850 s  ← Falls out of L2

10. Advanced Cache Topics

10.1 Non-Temporal Stores

Bypass cache for write-once data:

#include <emmintrin.h>

void write_without_caching(float *dest, float *src, int n) {
    for (int i = 0; i < n; i += 4) {
        __m128 data = _mm_load_ps(&src[i]);
        _mm_stream_ps(&dest[i], data);  // Bypass cache
    }
    _mm_sfence();  // Ensure stores complete
}

// Use when:
// - Writing large amounts of data
// - Data won't be read again soon
// - Don't want to pollute cache

10.2 Cache Partitioning (Intel CAT)

# Intel Cache Allocation Technology
# Partition L3 cache between applications

# Check support
cat /sys/fs/resctrl/info/L3/cbm_mask

# Create partition with 4 cache ways
mkdir /sys/fs/resctrl/partition1
echo "0xf" > /sys/fs/resctrl/partition1/schemata

# Assign process to partition
echo $PID > /sys/fs/resctrl/partition1/tasks

Use cases:

  • Isolate noisy neighbors
  • Guarantee cache for latency-sensitive tasks
  • Prevent cache thrashing between workloads

10.3 NUMA and Cache Considerations

NUMA (Non-Uniform Memory Access):
┌────────────────────┐    ┌────────────────────┐
│      Socket 0      │    │      Socket 1      │
│  ┌──────────────┐  │    │  ┌──────────────┐  │
│  │    Cores     │  │    │  │    Cores     │  │
│  └──────┬───────┘  │    │  └──────┬───────┘  │
│         │          │    │         │          │
│  ┌──────▼───────┐  │    │  ┌──────▼───────┐  │
│  │   L3 Cache   │  │    │  │   L3 Cache   │  │
│  └──────┬───────┘  │    │  └──────┬───────┘  │
│         │          │    │         │          │
│  ┌──────▼───────┐  │    │  ┌──────▼───────┐  │
│  │Local Memory  │◄─┼────┼─►│Local Memory  │  │
│  └──────────────┘  │    │  └──────────────┘  │
└────────────────────┘    └────────────────────┘
         │                         │
         └────────┬────────────────┘
            Interconnect

Local memory access: ~100 ns
Remote memory access: ~150-200 ns

Cache coherence across sockets: expensive!

10.4 Persistent Memory and Caching

// Intel Optane DC Persistent Memory
// New caching considerations

#include <libpmem.h>

void persist_data(void *dest, void *src, size_t len) {
    memcpy(dest, src, len);
    
    // Ensure data reaches persistent memory
    // (not just CPU cache)
    pmem_persist(dest, len);
}

// Cache line flush instructions:
// CLFLUSH: Flush and invalidate
// CLFLUSHOPT: Optimized flush (can be parallel)
// CLWB: Write back without invalidate (preferred)

11. Historical Evolution

11.1 Cache History

1960s: First cache (IBM System/360 Model 85)
       - 16-32 KB
       - Proved caching concept

1980s: On-chip L1 caches
       - Intel 486: 8 KB unified cache
       - Brought cache onto CPU die

1990s: Split I/D caches, L2 on package
       - Pentium: separate I-cache and D-cache
       - Pentium Pro: L2 on same package

2000s: L2 on-die, L3 introduced
       - Pentium 4: on-die L2
       - Core 2: shared L3

2010s: Large shared L3, advanced coherence
       - Sandy Bridge: 8-way L3
       - AMD Zen: L3 victim cache

2020s: 3D V-cache, larger L3
       - AMD 3D V-Cache: 96 MB L3
       - Intel Hybrid: different caches for P/E cores

11.2 Future Directions

Emerging trends:
- 3D-stacked cache (more capacity)
- Adaptive replacement policies
- ML-based prefetching
- Near-memory processing
- Processing-in-cache architectures

Challenges:
- Power scaling limits cache growth
- Coherence overhead increases with core count
- Memory wall continues to widen

12. Real-World Cache Optimization Case Studies

12.1 Database Buffer Pools

Database systems carefully manage cache utilization:

PostgreSQL buffer pool strategy:
┌─────────────────────────────────────────────────┐
│              Shared Buffer Pool                  │
├─────────────────────────────────────────────────┤
│  Ring Buffers (for sequential scans)            │
│  ├─ Limited size (256KB default)                │
│  └─ Prevents cache pollution from full scans    │
├─────────────────────────────────────────────────┤
│  Clock Sweep (for random access)                │
│  ├─ Usage count per page                        │
│  └─ Popular pages stay resident                 │
└─────────────────────────────────────────────────┘

The lesson: Application-level caching must consider
CPU cache effects too. Page layout affects L1/L2 hit rates.

12.2 High-Frequency Trading

Financial systems obsess over cache behavior:

// HFT order book: hot data must fit in L1

struct alignas(64) Order {  // Cache line aligned
    uint64_t price;
    uint64_t quantity;
    uint64_t order_id;
    uint32_t side;
    uint32_t flags;
    // Exactly 32 bytes - two orders per cache line
};

// Top of book (most accessed) kept separate
struct alignas(64) TopOfBook {
    uint64_t best_bid;
    uint64_t best_ask;
    uint64_t bid_size;
    uint64_t ask_size;
    // Fits in single cache line
};

// Result: Critical path accesses 1-2 cache lines
// Latency: sub-microsecond order processing

12.3 Game Engine Entity Systems

Modern game engines use data-d8a620d design:

// Traditional OOP (cache-unfriendly):
class Entity {
    Transform transform;      // 64 bytes
    Physics physics;          // 128 bytes
    Renderer renderer;        // 256 bytes
    AI ai;                    // 512 bytes
    // ... many more components
};
vector<Entity> entities;  // Huge stride

// Update physics: loads entire Entity for each
for (auto& e : entities) {
    e.physics.update();  // Cache miss likely
}

// Data-oriented (cache-friendly):
struct PhysicsComponents {
    vector<Vec3> positions;
    vector<Vec3> velocities;
    vector<float> masses;
};

// Update physics: sequential access
for (int i = 0; i < count; i++) {
    positions[i] += velocities[i] * dt;  // Vectorizable
}

// Result: 10-50x improvement in physics update

12.4 Compiler Optimization Matrices

Compilers do matrix transformations on large arrays:

// LLVM's sparse matrix representation
// Hot arrays separated from cold metadata

struct SparseRow {
    uint32_t *indices;     // Column indices (hot)
    double *values;        // Values (hot)
    uint32_t size;         // Metadata (cold)
    uint32_t capacity;     // Metadata (cold)
};

// During SpMV (sparse matrix-vector multiply):
// indices and values accessed sequentially
// size/capacity rarely touched

// Further optimization: indices and values interleaved
// for better prefetching in some access patterns

12.5 Network Packet Processing

High-performance networking optimizes for cache:

// DPDK packet buffer structure
// Designed for cache efficiency

struct rte_mbuf {
    // First cache line (64 bytes) - hot path
    void *buf_addr;
    uint16_t data_off;
    uint16_t data_len;
    uint32_t pkt_len;
    // ... other hot fields
    
    // Second cache line - less frequent access
    struct rte_mbuf *next;
    uint16_t nb_segs;
    // ... metadata
    
    // Remaining lines - rarely accessed
    uint64_t timestamp;
    // ... debugging info
};

// Packet headers also aligned for single-line access:
// Ethernet (14) + IP (20) + TCP (20) = 54 bytes
// Fits in one cache line with minimal padding

13. Debugging Cache Performance Issues

13.1 Identifying Cache Problems

Common symptoms of cache issues:

Symptom: Code runs slower than expected
Possible cache causes:
├─ Working set exceeds cache size
├─ Poor access patterns (strided, random)
├─ False sharing in multithreaded code
├─ Structure layout causing extra misses
└─ Unintended memory allocator behavior

Symptom: Performance varies between runs
Possible cache causes:
├─ ASLR changing alignment
├─ Different initial cache state
└─ Memory allocator placing data differently

Symptom: Adding threads makes it slower
Possible cache causes:
├─ False sharing
├─ Cache line bouncing between cores
└─ L3 contention

13.2 Profiling Tools Comparison

# perf: Quick overview
perf stat -e cache-references,cache-misses ./program

# perf c2c: Find false sharing
perf c2c record ./program
perf c2c report

# cachegrind: Detailed simulation (slow but precise)
valgrind --tool=cachegrind ./program

# Intel VTune: Comprehensive analysis
vtune -collect memory-access ./program

# AMD uProf: AMD-specific insights  
uprof-cli -C memory ./program

13.3 Interpreting perf c2c Output

=================================================
           Shared Data Cache Line Table
=================================================
          Total      Hitm    Snoop    Remote
  Index   Records    Lcl    Hitm     Hitm      PA
      0      4521   2341      12      198    0x7f...

HITM (Hit Modified): Cache line was modified in another cache
- High HITM = cache line bouncing between cores
- Often indicates false sharing

Drill down:
  0.15%  [kernel]  lock_acquire
  0.12%  program   increment_counter   ← Source of contention

13.4 Cache-Aware Memory Allocators

// Standard malloc may cause cache issues:
// - Adjacent allocations may false share
// - No alignment guarantees beyond 16 bytes

// Solutions:

// 1. aligned_alloc (C11)
void *ptr = aligned_alloc(64, size);  // Cache line aligned

// 2. posix_memalign (POSIX)
void *ptr;
posix_memalign(&ptr, 64, size);

// 3. Custom allocators with cache awareness
// jemalloc, tcmalloc offer better behavior

// 4. Arena allocators for related objects
struct Arena {
    char *base;
    size_t offset;
};

void *arena_alloc(Arena *a, size_t size) {
    // Objects allocated together stay together
    // Better spatial locality
    void *ptr = a->base + a->offset;
    a->offset += (size + 63) & ~63;  // Cache line aligned
    return ptr;
}

13.5 Automated Cache Optimization

// GCC/Clang provide hints:

// Prefetch hint
__builtin_prefetch(address, rw, locality);
// rw: 0=read, 1=write
// locality: 0=no locality to 3=high locality

// Structure packing
struct __attribute__((packed)) Compact {
    char a;
    int b;
    // No padding between a and b
};

// Cache line alignment
struct __attribute__((aligned(64))) Aligned {
    int data[16];
};

// Hot/cold function splitting
void __attribute__((hot)) critical_path() {
    // Compiler optimizes more aggressively
}

void __attribute__((cold)) error_handler() {
    // Compiler optimizes for size over speed
}

14. The Future of Caching

14.1 3D-Stacked Caches

Traditional:
┌──────────────────────┐
│   CPU Die            │
│  ┌────────────────┐  │
│  │   Cores + L3   │  │
│  └────────────────┘  │
└──────────────────────┘

3D V-Cache (AMD):
┌──────────────────────┐
│   3D V-Cache Die     │  ← Additional 64MB L3
│  (stacked on top)    │
├──────────────────────┤
│   CPU Die            │
│  ┌────────────────┐  │
│  │   Cores + L3   │  │
│  └────────────────┘  │
└──────────────────────┘

Benefits:
- 3x more L3 capacity
- Same latency as base L3
- Significant gains for gaming, simulation

14.2 Machine Learning for Prefetching

Traditional prefetcher:
- Detects simple patterns
- Limited stride detection
- No semantic understanding

ML-based prefetcher:
- Learns program behavior
- Predicts irregular patterns
- Adapts to workload

Research examples:
- Delta-LSTM: Uses LSTM to predict address deltas
- Voyager: Graph neural network for prefetching
- Pythia: RL-based prefetching decisions

14.3 Processing Near or In Cache

Moving computation closer to data:

Near-Memory Processing:
- Logic near DRAM
- Reduces data movement
- Good for memory-bound workloads

Processing-in-Cache:
- Simple operations in SRAM
- Bit-line computing
- Reduces energy dramatically

Examples:
- AMD's newer architectures explore this
- Research: SCOPE, ComputeDRAM, Ambit

14.4 Cache Challenges in Modern Architectures

As systems become more complex, cache design faces new challenges:

Heterogeneous Computing:
- CPU and GPU share memory
- Different cache architectures must cooperate
- Coherence becomes more expensive

ARM big.LITTLE / Intel Hybrid:
- Different core types with different caches
- P-cores: Large L2, shared L3
- E-cores: Smaller L2, may share different L3
- Task migration must consider cache state

Chiplet Architectures:
- AMD Ryzen: Multiple CCDs with separate L3
- Cross-chiplet coherence adds latency
- Locality matters more than ever

┌──────────────┐    ┌──────────────┐
│    CCD 0     │    │    CCD 1     │
│ ┌──────────┐ │    │ ┌──────────┐ │
│ │   L3     │ │    │ │   L3     │ │
│ │  32 MB   │ │    │ │  32 MB   │ │
│ └──────────┘ │    │ └──────────┘ │
│   4 cores    │    │   4 cores    │
└──────┬───────┘    └──────┬───────┘
       │                    │
       └────────┬───────────┘
          Infinity Fabric
          (higher latency)

14.5 Security Implications of Caches

Caches have been exploited in numerous attacks:

Spectre/Meltdown (2018):
- Speculative execution leaves cache traces
- Timing attacks reveal secret data
- Required fundamental CPU changes

Cache Timing Attacks:
- Measure access time to determine cache state
- Can reveal cryptographic keys
- AES T-table attacks

Prime+Probe:
1. Fill cache set with attacker data
2. Victim runs and evicts some lines
3. Attacker measures which lines evicted
4. Infers victim's access patterns

Mitigations:
- Constant-time cryptography
- Cache partitioning
- Randomized cache indexing (CEASER)

15. Summary

CPU caches are the critical technology bridging the processor-memory speed gap:

Architecture fundamentals:

  • Cache lines (typically 64 bytes)
  • Set-associative organization
  • Multi-level hierarchy (L1 → L2 → L3)

Coherence protocols:

  • MESI/MOESI maintain consistency
  • Snooping detects conflicts
  • False sharing causes performance issues

Writing cache-friendly code:

  • Sequential access patterns
  • Proper loop ordering
  • Structure layout optimization
  • Data-oriented design
  • Loop blocking/tiling

Performance analysis:

  • Use perf counters
  • Measure miss rates
  • Identify working set sizes
  • Profile with cachegrind

Advanced techniques:

  • Software prefetching
  • Non-temporal stores
  • Cache partitioning
  • NUMA awareness

Understanding CPU caches transforms how you write performance-critical code. The difference between cache-friendly and cache-oblivious code can easily be 10-100x in performance. Profile, measure, and optimize your memory access patterns—your caches will thank you.