Virtual Memory and Page Tables: How Modern Systems Manage Memory

2022-05-19 · Leonardo Benicio

A comprehensive exploration of virtual memory, page tables, and address translation. Learn how operating systems provide memory isolation, enable overcommitment, and optimize performance with TLBs and huge pages.

Every process believes it has exclusive access to a vast, contiguous memory space. This elegant illusion—virtual memory—is one of the most important abstractions in computing. Behind it lies a sophisticated system of page tables, TLBs, and hardware-software cooperation that enables memory isolation, efficient sharing, and seemingly infinite memory. Let’s explore how it all works.

1. The Problem: Why Virtual Memory?

Before virtual memory, programs used physical addresses directly. This created serious problems.

1.1 Memory Isolation

Without isolation, any process could read or corrupt another’s memory:

// In a world without virtual memory
int *ptr = (int *)0x12345678;  // Physical address
*ptr = 0xDEADBEEF;             // Might corrupt the kernel!

Modern systems prevent this entirely. Each process has its own address space:

// With virtual memory
int *ptr = (int *)0x12345678;  // Virtual address
*ptr = 0xDEADBEEF;             // Only affects THIS process's memory

Two processes can use the same virtual address, but they map to different physical locations.

1.2 Memory Overcommitment

Physical RAM is limited, but virtual address spaces are vast:

Physical RAM:     16 GB
Virtual space:    128 TB (per process on x86-64)
Number of processes: 100+

Total virtual memory: 100 × 128 TB = 12,800 TB

Virtual memory enables this overcommitment through:

Demand paging: Pages are allocated only when first accessed
Swapping: Unused pages move to disk
Sharing: Common pages (libc, kernel) are mapped once

1.3 Memory Fragmentation

Physical memory becomes fragmented over time:

Physical memory after hours of use:
[Used][Free][Used][Free][Used][Used][Free][Used][Free]

Contiguous virtual allocation:
[    Contiguous 16KB virtual buffer    ]
     ↓        ↓        ↓        ↓
[4KB][    ][4KB][    ][4KB][4KB][    ][4KB]
     scattered across physical memory

Virtual memory provides contiguous virtual addresses backed by scattered physical pages.

1.4 Position-Independent Code

Without virtual memory, programs need relocation at load time:

// Without virtual memory: absolute addresses compiled in
call 0x401000  // What if another program is using that address?

// With virtual memory: every process starts at the same virtual address
call 0x401000  // Each process has its own 0x401000

2. Pages and Frames

Virtual memory divides memory into fixed-size units.

2.1 Terminology

Page: A fixed-size block of virtual memory (typically 4KB)
Frame: A fixed-size block of physical memory (same size as page)
Page table: Data structure mapping virtual pages to physical frames

Virtual Address Space          Physical Memory
┌───────────────────┐          ┌───────────────────┐
│ Page 0 (0x0000)   │ ──────── │ Frame 7           │
├───────────────────┤          ├───────────────────┤
│ Page 1 (0x1000)   │ ──────── │ Frame 2           │
├───────────────────┤          ├───────────────────┤
│ Page 2 (0x2000)   │ ──╳      │ Frame 3           │
├───────────────────┤  (not    ├───────────────────┤
│ Page 3 (0x3000)   │ ──────── │ Frame 5           │
└───────────────────┘  mapped) └───────────────────┘

2.2 Address Translation

A virtual address is split into page number and offset:

32-bit virtual address with 4KB pages:
┌─────────────────────────┬──────────────┐
│    Page Number (20 bits)│ Offset (12)  │
└─────────────────────────┴──────────────┘
         ↓
    Page Table Lookup
         ↓
┌─────────────────────────┬──────────────┐
│   Frame Number (20 bits)│ Offset (12)  │
└─────────────────────────┴──────────────┘
        Physical Address

The offset stays the same—only the page/frame number changes:

def translate_address(virtual_addr, page_table, page_size=4096):
    page_number = virtual_addr // page_size
    offset = virtual_addr % page_size
    
    if page_number not in page_table:
        raise PageFault(page_number)
    
    frame_number = page_table[page_number]
    physical_addr = frame_number * page_size + offset
    
    return physical_addr

2.3 Page Table Entries

Each page table entry (PTE) contains more than just the frame number:

x86-64 Page Table Entry (64 bits):
┌──────────────────────────────────────────────────────────────────┐
│ 63│62:52│51:M │M-1:12    │11:9│8  │7  │6│5│4  │3  │2│1│0│
│NX │     │RSVD │Frame Num │AVL │G  │PAT│D│A│PCD│PWT│U│W│P│
└──────────────────────────────────────────────────────────────────┘

P   = Present (is this page in physical memory?)
W   = Writable (can we write to this page?)
U   = User (can user-mode code access this?)
A   = Accessed (has this page been read?)
D   = Dirty (has this page been written?)
G   = Global (don't flush from TLB on context switch)
NX  = No Execute (prevent code execution from this page)

These bits enable:

Copy-on-write: Mark pages read-only, copy on write fault
Demand paging: Mark pages not-present, allocate on fault
Memory protection: Prevent user access to kernel pages
DEP/NX: Prevent execution of data pages

3. Multi-Level Page Tables

A flat page table for a 64-bit address space would be enormous.

3.1 The Size Problem

With 4KB pages and 48-bit virtual addresses (256TB):

Number of pages = 2^48 / 2^12 = 2^36 = 64 billion pages
PTE size = 8 bytes
Page table size = 64 billion × 8 = 512 GB per process!

Even for a process using only 1MB of memory, we’d need a 512GB page table.

3.2 Hierarchical Page Tables

The solution: multi-level page tables that are sparse:

x86-64 Four-Level Page Table:

Virtual Address (48 bits used):
┌─────────┬─────────┬─────────┬─────────┬────────────┐
│PML4 (9) │PDPT (9) │ PD (9)  │ PT (9)  │Offset (12) │
└────┬────┴────┬────┴────┬────┴────┬────┴────────────┘
     │         │         │         │
     ▼         ▼         ▼         ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ PML4   │→│ PDPT   │→│  PD    │→│  PT    │→ Frame
│(512 ent)│ │(512 ent)│ │(512 ent)│ │(512 ent)│
└────────┘ └────────┘ └────────┘ └────────┘

Each level has 512 entries (9 bits), and tables are only allocated as needed:

def translate_4level(virtual_addr, cr3):
    """x86-64 four-level page table walk."""
    # Extract indices from virtual address
    pml4_idx = (virtual_addr >> 39) & 0x1FF
    pdpt_idx = (virtual_addr >> 30) & 0x1FF
    pd_idx   = (virtual_addr >> 21) & 0x1FF
    pt_idx   = (virtual_addr >> 12) & 0x1FF
    offset   = virtual_addr & 0xFFF
    
    # Walk the page table hierarchy
    pml4 = read_physical(cr3)
    pml4e = pml4[pml4_idx]
    if not pml4e.present:
        raise PageFault(virtual_addr)
    
    pdpt = read_physical(pml4e.frame_addr)
    pdpte = pdpt[pdpt_idx]
    if not pdpte.present:
        raise PageFault(virtual_addr)
    
    pd = read_physical(pdpte.frame_addr)
    pde = pd[pd_idx]
    if not pde.present:
        raise PageFault(virtual_addr)
    
    pt = read_physical(pde.frame_addr)
    pte = pt[pt_idx]
    if not pte.present:
        raise PageFault(virtual_addr)
    
    return pte.frame_addr + offset

3.3 Memory Savings

For a process using only a few megabytes:

Without hierarchy: 512 GB (fixed)

With hierarchy:
- 1 PML4 table: 4 KB
- 1 PDPT table: 4 KB  
- 1 PD table: 4 KB
- 1 PT table: 4 KB
Total: 16 KB (for up to 2 MB of mappings)

Sparse address spaces only allocate the page table entries they need.

3.4 Five-Level Page Tables

Modern CPUs support 5-level paging for 57-bit virtual addresses (128 PB):

LA57 (5-level paging):
┌─────────┬─────────┬─────────┬─────────┬─────────┬────────────┐
│PML5 (9) │PML4 (9) │PDPT (9) │ PD (9)  │ PT (9)  │Offset (12) │
└─────────┴─────────┴─────────┴─────────┴─────────┴────────────┘

This is primarily useful for memory-mapped file systems and persistent memory.

4. The Translation Lookaside Buffer (TLB)

Walking the page table for every memory access would be devastatingly slow.

4.1 The Problem

Each memory access requires multiple page table lookups:

Without TLB, reading one byte:
1. Read PML4 entry    (memory access)
2. Read PDPT entry    (memory access)
3. Read PD entry      (memory access)
4. Read PT entry      (memory access)
5. Read actual data   (memory access)

Total: 5 memory accesses for 1 logical access!

4.2 TLB as a Cache

The TLB caches recent virtual-to-physical translations:

TLB (Translation Lookaside Buffer):
┌─────────────────┬────────────────┬───────────┐
│ Virtual Page    │ Physical Frame │ Flags     │
├─────────────────┼────────────────┼───────────┤
│ 0x7fff_8000     │ 0x1234_5000    │ RWX, User │
│ 0x0040_1000     │ 0x0042_3000    │ R-X, User │
│ 0xffff_8000     │ 0x0000_1000    │ RW-, Kern │
│ ...             │ ...            │ ...       │
└─────────────────┴────────────────┴───────────┘

With a TLB hit, translation takes 1 cycle instead of 4 memory accesses.

4.3 TLB Organization

Modern CPUs have multiple TLB levels:

Intel Core i7 TLB Organization:

L1 ITLB (instruction): 128 entries, 4-way set associative
L1 DTLB (data): 64 entries, 4-way set associative
L2 STLB (unified): 1536 entries, 12-way set associative

TLB miss rates are typically < 1% for most workloads.

4.4 TLB Management

The OS must manage TLB consistency:

// When page tables change, flush affected TLB entries

// Flush entire TLB (expensive)
void flush_tlb_all(void) {
    unsigned long cr3 = read_cr3();
    write_cr3(cr3);  // Writing CR3 flushes entire TLB
}

// Flush single page (x86 INVLPG instruction)
void flush_tlb_page(unsigned long addr) {
    asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}

// Flush range of pages
void flush_tlb_range(unsigned long start, unsigned long end) {
    for (unsigned long addr = start; addr < end; addr += PAGE_SIZE) {
        flush_tlb_page(addr);
    }
}

TLB shootdowns are needed for multi-core systems:

// TLB shootdown: flush TLB on all CPUs
void flush_tlb_all_cpus(unsigned long addr) {
    // Send IPI (Inter-Processor Interrupt) to all CPUs
    for_each_online_cpu(cpu) {
        if (cpu != current_cpu) {
            send_ipi(cpu, TLB_FLUSH_IPI);
        }
    }
    flush_tlb_page(addr);
    wait_for_ack_from_all_cpus();
}

TLB shootdowns are expensive and can become a bottleneck for applications with heavy page table modifications.

5. Page Faults

When translation fails, the CPU triggers a page fault.

5.1 Types of Page Faults

void handle_page_fault(unsigned long address, unsigned long error_code) {
    struct vm_area_struct *vma = find_vma(current->mm, address);
    
    if (!vma || address < vma->vm_start) {
        // Segmentation fault: address not mapped
        send_signal(SIGSEGV, current);
        return;
    }
    
    if (error_code & PF_WRITE && !(vma->vm_flags & VM_WRITE)) {
        if (vma->vm_flags & VM_MAYWRITE) {
            // Copy-on-write fault
            handle_cow_fault(vma, address);
        } else {
            // Permission violation
            send_signal(SIGSEGV, current);
        }
        return;
    }
    
    if (!(error_code & PF_PRESENT)) {
        // Page not in memory
        if (vma->vm_file) {
            // Memory-mapped file: read from disk
            handle_file_fault(vma, address);
        } else {
            // Anonymous page: allocate zeroed page
            handle_anon_fault(vma, address);
        }
    }
}

5.2 Demand Paging

Pages are allocated only when first accessed:

void *ptr = mmap(NULL, 1024*1024*1024, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// No physical memory allocated yet!

ptr[0] = 'A';  // Page fault → allocate page 0
ptr[4096] = 'B';  // Page fault → allocate page 1
// Only 8KB of physical memory used for a 1GB mapping

This enables memory overcommitment:

// Linux can promise more memory than physically exists
// vm.overcommit_memory controls this behavior

// Check how much memory is "committed" vs available
// /proc/meminfo: Committed_AS vs CommitLimit

5.3 Copy-on-Write (COW)

When a process forks, pages are shared until written:

pid_t pid = fork();
// Parent and child share all pages (marked read-only)

if (pid == 0) {
    // Child writes to a shared page
    global_variable = 42;
    // Page fault → kernel copies the page
    // Child gets its own copy, parent's unchanged
}

This makes fork() nearly instantaneous, even for large processes.

5.4 Lazy Allocation

Stack and heap grow lazily:

void recursive_function(int depth) {
    char buffer[4096];  // Might trigger page fault for stack growth
    
    if (depth > 0) {
        recursive_function(depth - 1);
    }
}

// Stack limit is typically 8MB, but physical pages
// are allocated only as the stack grows

6. Huge Pages

Standard 4KB pages create overhead for large memory workloads.

6.1 The Problem with Small Pages

For a 128GB database buffer pool:

Number of pages: 128 GB / 4 KB = 32 million pages
TLB entries: ~1536 (L2 STLB)
TLB coverage: 1536 × 4 KB = 6 MB

Only 0.005% of the buffer pool fits in TLB!
Result: Constant TLB misses, performance degradation

6.2 Huge Page Sizes

x86-64 supports larger pages:

Page Size    Coverage per TLB entry    Use Case
─────────────────────────────────────────────────
4 KB         4 KB                      General purpose
2 MB         2 MB (512× more)          Large allocations
1 GB         1 GB (262144× more)       Huge allocations

With 2MB huge pages:

Number of pages: 128 GB / 2 MB = 64,000 pages
TLB coverage: 1536 × 2 MB = 3 GB

Now 2.3% of buffer pool fits in TLB (460× improvement!)

6.3 Using Huge Pages

Linux provides several mechanisms:

// Method 1: mmap with MAP_HUGETLB
void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

// Method 2: Transparent Huge Pages (THP)
// Kernel automatically uses huge pages when possible
// Enable: echo always > /sys/kernel/mm/transparent_hugepage/enabled

// Method 3: hugetlbfs
// mount -t hugetlbfs none /mnt/huge
int fd = open("/mnt/huge/myfile", O_CREAT|O_RDWR, 0755);
void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

6.4 Huge Page Challenges

Huge pages aren’t free:

Challenge 1: Fragmentation
- Need contiguous 2MB regions
- System may not have them after running a while
- Solution: Reserve huge pages at boot, or use compaction

Challenge 2: Memory waste
- Internal fragmentation: 1 byte allocation wastes 2MB - 1 byte
- Only beneficial for large allocations

Challenge 3: THP latency spikes
- Kernel compaction can cause long pauses
- khugepaged background thread uses CPU

Challenge 4: Memory accounting
- Hard to track actual usage with copy-on-write
- Can cause OOM in unexpected ways

Database best practices:

# Disable THP for databases (causes latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Use explicit huge pages instead
echo 8192 > /proc/sys/vm/nr_hugepages  # Reserve 16GB in 2MB pages

7. Memory-Mapped Files

Virtual memory enables efficient file I/O through memory mapping.

7.1 How It Works

int fd = open("database.db", O_RDWR);
void *map = mmap(NULL, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

// Reading from the file
int value = ((int *)map)[1000];
// If page not in memory: page fault → kernel reads from disk

// Writing to the file
((int *)map)[1000] = 42;
// Page marked dirty → kernel writes back later (or on msync/munmap)

7.2 Advantages

Memory-mapped files offer several benefits:

// 1. Zero-copy: data goes directly from disk to application memory
// No intermediate kernel buffers

// 2. Automatic caching: kernel page cache handles everything
// Recently accessed pages stay in memory

// 3. Lazy loading: only accessed pages are read from disk

// 4. Shared mappings: multiple processes can share the same pages
int fd = open("shared.db", O_RDWR);
void *map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
// Changes visible to all processes mapping the same file

7.3 Gotchas

Memory-mapped files have pitfalls:

// Problem 1: I/O errors become signals
void handle_sigbus(int sig) {
    // Disk error during page-in causes SIGBUS, not read() error
    // Hard to handle gracefully
}
signal(SIGBUS, handle_sigbus);

// Problem 2: No fine-grained control over I/O timing
// Can't prioritize which pages to read first

// Problem 3: Page-sized I/O granularity
// Reading 1 byte still loads entire 4KB page

// Problem 4: Difficult to manage memory pressure
// Can't easily tell the kernel which pages to evict

7.4 madvise Hints

Tell the kernel about access patterns:

// Sequential access: prefetch ahead
madvise(map, size, MADV_SEQUENTIAL);

// Random access: don't prefetch
madvise(map, size, MADV_RANDOM);

// Will need this soon: prefetch pages
madvise(map + offset, length, MADV_WILLNEED);

// Won't need anymore: allow kernel to free pages
madvise(map + offset, length, MADV_DONTNEED);

// Using huge pages would help
madvise(map, size, MADV_HUGEPAGE);

8. Kernel vs. User Address Space

The virtual address space is split between kernel and user.

8.1 Address Space Layout

x86-64 Linux with 4-level paging (48-bit addresses):

User space: 0x0000_0000_0000_0000 - 0x0000_7fff_ffff_ffff (128 TB)
            ↑
            Lower half of address space

[Canonical hole: addresses with bits 48-63 neither all 0s nor all 1s]

Kernel space: 0xffff_8000_0000_0000 - 0xffff_ffff_ffff_ffff (128 TB)
              ↑
              Upper half of address space (all 1s in bits 48-63)

8.2 Why Share the Address Space?

Kernel mappings in every process’s address space:

// System call without shared kernel mapping:
// 1. Save all user registers
// 2. Load kernel page table (CR3 write = TLB flush!)
// 3. Execute system call
// 4. Load user page table (another TLB flush!)
// 5. Restore registers

// With shared mapping:
// 1. Switch to ring 0
// 2. Execute system call (kernel pages already mapped!)
// 3. Return to ring 3
// No page table switch, no TLB flush!

8.3 Kernel Page Table Isolation (KPTI)

Meltdown vulnerability forced a change:

Before Meltdown/KPTI:
User process sees: [User pages] [Kernel pages (inaccessible but mapped)]

After KPTI:
User mode: [User pages] [Minimal kernel trampoline]
Kernel mode: [User pages] [Full kernel mapping]

Page table switch on every syscall/interrupt - performance cost!

KPTI overhead:

Syscall-heavy workloads: 5-30% slowdown
Memory-mapped I/O: 2-5% slowdown  
Compute-heavy: < 1% slowdown

9. NUMA and Virtual Memory

Non-Uniform Memory Access adds complexity to memory management.

9.1 NUMA Architecture

┌─────────────────────────────────────────────────────────────┐
│                       NUMA System                            │
│  ┌─────────────────────┐     ┌─────────────────────┐        │
│  │      Node 0         │     │      Node 1         │        │
│  │  ┌───────────────┐  │     │  ┌───────────────┐  │        │
│  │  │  CPU 0-7      │  │     │  │  CPU 8-15     │  │        │
│  │  └───────┬───────┘  │     │  └───────┬───────┘  │        │
│  │          │          │     │          │          │        │
│  │  ┌───────▼───────┐  │     │  ┌───────▼───────┐  │        │
│  │  │ Local Memory  │◄─┼─────┼─►│ Local Memory  │  │        │
│  │  │   (fast)      │  │     │  │   (fast)      │  │        │
│  │  └───────────────┘  │     │  └───────────────┘  │        │
│  └─────────────────────┘     └─────────────────────┘        │
│              ▲ Interconnect (slower than local) ▲            │
└─────────────────────────────────────────────────────────────┘

Memory access latency varies by location:

Local memory access:  ~80 ns
Remote memory access: ~140 ns (1.75× slower)

9.2 NUMA-Aware Memory Allocation

#include <numa.h>

// Allocate on specific node
void *ptr = numa_alloc_onnode(size, node);

// Allocate interleaved across all nodes
void *ptr = numa_alloc_interleaved(size);

// Bind memory to nodes
numa_tonode_memory(ptr, size, node);

// Set NUMA policy for allocations
set_mempolicy(MPOL_BIND, &nodemask, maxnode);

9.3 First-Touch Policy

Linux defaults to first-touch allocation:

void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// No physical memory allocated yet

// Thread on node 0 touches the memory first
ptr[0] = 'A';
// Page allocated on node 0

// Thread on node 1 accesses the same page
char c = ptr[0];
// Remote access! 1.75× slower

This can cause performance problems:

// Anti-pattern: initialize memory on one thread, use on another
void init_thread() {
    for (int i = 0; i < size; i++) {
        buffer[i] = 0;  // All pages allocated on init thread's node
    }
}

void worker_threads() {
    // If workers are on different nodes, all accesses are remote!
}

// Better: parallel first-touch initialization
#pragma omp parallel for
for (int i = 0; i < size; i++) {
    buffer[i] = 0;  // Each thread touches pages it will use
}

9.4 NUMA Balancing

Linux can automatically migrate pages:

# Enable automatic NUMA balancing
echo 1 > /proc/sys/kernel/numa_balancing

# The kernel tracks page access patterns and migrates pages
# to be closer to the threads accessing them

However, migration has overhead:

Page migration cost: Copy 4KB page + TLB flush + update page tables
Worth it only if: Future access savings > migration cost
Kernel uses heuristics: access frequency, migration history

10. Virtual Memory Security

Virtual memory is the foundation of process isolation and security.

10.1 Address Space Layout Randomization (ASLR)

Randomize memory layout to hinder exploits:

# Check ASLR setting
cat /proc/sys/kernel/randomize_va_space
# 0 = disabled, 1 = partial, 2 = full

# Run a program twice, observe different addresses
$ cat /proc/self/maps | head -3
55a3b2c00000-55a3b2c01000 r--p ...
$ cat /proc/self/maps | head -3
559c1a400000-559c1a401000 r--p ...  # Different!

ASLR randomizes:

Stack location
Heap location
Library load addresses
Executable base (PIE)
mmap regions

10.2 Stack Canaries and Guard Pages

Protect against stack buffer overflows:

void vulnerable_function(char *input) {
    char buffer[64];
    unsigned long canary = __stack_chk_guard;  // Random value
    
    strcpy(buffer, input);  // Potential overflow
    
    if (canary != __stack_chk_guard) {
        __stack_chk_fail();  // Overflow detected!
    }
}

Guard pages prevent stack overflow into other memory:

Stack Layout:
┌──────────────────┐
│  Stack growth ↓  │
├──────────────────┤
│   Guard page     │ ← Not mapped, access triggers SIGSEGV
├──────────────────┤
│  Next region     │
└──────────────────┘

10.3 W^X (Write XOR Execute)

Pages should never be both writable and executable:

// Allocate non-executable heap (default)
void *heap = malloc(1024);  // W, no X

// Allocate non-writable code
void *code = mmap(NULL, 4096, PROT_READ|PROT_EXEC,
                  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);  // R+X, no W

// JIT compilation: write then execute
void *jit = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);  // W, no X
// ... generate code ...
mprotect(jit, 4096, PROT_READ|PROT_EXEC);  // Now R+X, no W

10.4 Memory Tagging (MTE)

ARM Memory Tagging Extension catches memory safety bugs:

// Each pointer has a 4-bit tag (in unused high bits)
// Each 16-byte granule of memory has a 4-bit tag
// Access checks: pointer tag must match memory tag

char *ptr = tagged_malloc(16);  // ptr has tag 0x5, memory has tag 0x5
ptr[0] = 'A';                   // Tags match, OK

free(ptr);                      // Memory tag changed to 0x7

ptr[0] = 'B';                   // Tag mismatch! Hardware exception
// Use-after-free detected!

11. Performance Debugging

Understanding virtual memory behavior is crucial for performance.

11.1 Measuring Page Faults

# Count page faults for a process
$ /usr/bin/time -v ./myprogram
Minor (reclaiming a frame) page faults: 12345
Major (requiring I/O) page faults: 100

# Live monitoring
$ perf stat -e page-faults,minor-faults,major-faults ./myprogram

// Programmatic monitoring
#include <sys/resource.h>

struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
printf("Minor faults: %ld\n", usage.ru_minflt);
printf("Major faults: %ld\n", usage.ru_majflt);

11.2 TLB Profiling

# Measure TLB misses
$ perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses ./myprogram

# TLB miss ratio
$ perf stat -e dTLB-loads,dTLB-load-misses ./myprogram
# 1,000,000 dTLB-loads
#    10,000 dTLB-load-misses  # 1% miss rate

High TLB miss rates suggest:

Working set too large for TLB
Consider huge pages
Improve memory locality

11.3 Memory Access Patterns

# Visualize memory access patterns
$ perf record -e mem_load_retired.l3_miss ./myprogram
$ perf report

# Memory bandwidth and latency
$ perf stat -e cache-references,cache-misses ./myprogram

11.4 Common Performance Issues

Issue: High major page fault rate
Cause: Memory pressure, frequent swapping
Fix: Add RAM, reduce memory usage, tune swappiness

Issue: High TLB miss rate
Cause: Large working set, scattered access patterns
Fix: Use huge pages, improve locality

Issue: NUMA imbalance
Cause: Memory on wrong node, poor first-touch
Fix: numactl binding, parallel initialization

Issue: High syscall overhead (with KPTI)
Cause: Frequent user/kernel transitions
Fix: Batch syscalls, use io_uring

12. Virtual Memory Across Operating Systems

Different operating systems implement virtual memory with their own approaches and optimizations.

12.1 Linux

Linux’s virtual memory system is highly configurable:

# Key tunables in /proc/sys/vm/

# Swappiness: tendency to swap vs. drop cache (0-100)
cat /proc/sys/vm/swappiness  # Default 60

# Overcommit behavior
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic, 1 = always overcommit, 2 = never overcommit

# Dirty page writeback thresholds
cat /proc/sys/vm/dirty_ratio          # % of RAM for dirty pages
cat /proc/sys/vm/dirty_background_ratio  # Start background writeback

# Huge page settings
cat /proc/sys/vm/nr_hugepages
cat /sys/kernel/mm/transparent_hugepage/enabled

Linux memory zones:

Zone       Purpose
─────────────────────────────────
ZONE_DMA   Legacy 16MB for ISA DMA
ZONE_DMA32 Memory addressable by 32-bit DMA
ZONE_NORMAL Regular memory
ZONE_MOVABLE For memory hotplug/migration

12.2 Windows

Windows uses a different terminology and approach:

Windows Memory Concepts:
─────────────────────────
Working Set: Pages currently in physical memory
Commit Charge: Total virtual memory committed
Paged Pool: Kernel memory that can be paged out
Nonpaged Pool: Kernel memory that must stay resident

Windows memory management differs from Linux in several ways:

// Windows API for memory management
LPVOID VirtualAlloc(
    LPVOID lpAddress,
    SIZE_T dwSize,
    DWORD  flAllocationType,  // MEM_RESERVE, MEM_COMMIT
    DWORD  flProtect
);

// Reserve address space without committing physical memory
void *reserved = VirtualAlloc(NULL, 1GB, MEM_RESERVE, PAGE_NOACCESS);

// Later, commit portions as needed
void *committed = VirtualAlloc(reserved, 4096, MEM_COMMIT, PAGE_READWRITE);

Windows has distinct reserve and commit operations, unlike Linux’s implicit overcommit.

12.3 FreeBSD

FreeBSD’s virtual memory system has unique characteristics:

FreeBSD VM Features:
─────────────────────────────────────────
Superpages: Automatic huge page promotion
VM objects: Copy-on-write at object level
Swap clustering: Group pages for efficient swap I/O
NUMA domains: First-class NUMA support

12.4 macOS

macOS builds on Mach microkernel VM concepts:

// Mach VM API
vm_allocate(mach_task_self(), &address, size, VM_FLAGS_ANYWHERE);
vm_deallocate(mach_task_self(), address, size);

// Memory pressure notifications
dispatch_source_t source = dispatch_source_create(
    DISPATCH_SOURCE_TYPE_MEMORYPRESSURE,
    0,
    DISPATCH_MEMORYPRESSURE_WARN | DISPATCH_MEMORYPRESSURE_CRITICAL,
    queue
);

macOS aggressive compression:

macOS Memory Compression:
─────────────────────────
Instead of swapping to disk immediately, macOS compresses inactive pages
in memory. Decompression is much faster than disk I/O.

Typical compression ratio: 2-3x
Result: More effective RAM, less swap I/O

13. Advanced Topics

13.1 Persistent Memory

Intel Optane and similar technologies blur the line between memory and storage:

#include <libpmem.h>

// Map persistent memory
void *pmem = pmem_map_file("/pmem/data", size, 
                           PMEM_FILE_CREATE, 0666, &mapped_size, &is_pmem);

// Writes are durable after flush
memcpy(pmem + offset, data, len);
pmem_persist(pmem + offset, len);  // Data survives power loss!

This changes virtual memory assumptions:

Traditional:
Virtual Address → Physical RAM → (volatile, lost on power loss)

With Persistent Memory:
Virtual Address → Physical PMEM → (durable, survives reboot)

13.2 Memory Ballooning

Hypervisors use ballooning to reclaim memory from VMs:

Balloon Driver Operation:
1. Hypervisor wants to reclaim 1GB from VM
2. Sends signal to balloon driver in guest
3. Balloon driver allocates 1GB of guest memory
4. Guest OS pages out or compresses that memory
5. Balloon driver tells hypervisor which physical pages to reclaim
6. Hypervisor unmaps those pages, can give to other VMs

// Linux virtio_balloon driver
static void balloon_inflate(struct virtio_balloon *vb) {
    struct page *page = alloc_page(GFP_KERNEL);
    list_add(&page->lru, &vb->pages);
    tell_host_about_page(page);  // Hypervisor can reclaim this
}

13.3 Memory Hotplug

Adding or removing memory while the system runs:

# Check available memory blocks
ls /sys/devices/system/memory/

# Online a new memory block
echo online > /sys/devices/system/memory/memory32/state

# Offline memory (requires pages to be migrated first)
echo offline > /sys/devices/system/memory/memory32/state

Challenges:

Offlining Memory:
1. Cannot offline pages with kernel data structures
2. Must migrate movable pages to other memory
3. Huge pages complicate migration
4. Some memory is inherently unmovable (kernel text)

13.4 Kernel Same-page Merging (KSM)

Deduplicate identical pages across processes:

// Mark memory region as mergeable
madvise(addr, size, MADV_MERGEABLE);

KSM operation:

1. ksmd kernel thread scans mergeable pages
2. Computes hash of each page
3. Compares pages with same hash
4. Identical pages merged (copy-on-write)
5. Significant memory savings for VMs with same OS

Trade-offs:

Benefits:
- Memory savings (especially for VMs running same OS)
- Can overcommit more aggressively

Costs:
- CPU overhead for scanning and hashing
- Memory access latency (copy-on-write faults)
- Security concern: timing attacks can detect merging

14. Historical Context and Evolution

14.1 Pre-Virtual Memory Era

Early systems used physical addresses directly:

1950s-1960s:
- Programs loaded at fixed addresses
- Only one program in memory at a time
- Manual memory management (overlays)

14.2 Segmentation

Before paging, segmentation provided some isolation:

Intel 8086 Segmentation:
Physical Address = (Segment × 16) + Offset

Segments provided:
- Code segment (CS)
- Data segment (DS)
- Stack segment (SS)
- Extra segment (ES)

Segments were variable-sized, leading to fragmentation.

14.3 Paging Revolution

Atlas Computer (1962) introduced paging:

Key innovations:
- Fixed-size pages (512 words)
- Demand paging
- Page replacement algorithms
- Virtual addresses transparent to programs

14.4 Modern Evolution

1990s: PAE (36-bit physical addresses on 32-bit x86)
2000s: x86-64 (48-bit virtual, later 57-bit)
2010s: Virtualization extensions (EPT/NPT)
2018: Hardware mitigations (KPTI, Retpoline)
2020s: Memory tagging, CXL memory expansion

15. Practical Implementation Example

Let’s implement a simple page table simulator to solidify understanding:

from dataclasses import dataclass
from typing import Optional, Dict
from enum import IntFlag

class PageFlags(IntFlag):
    PRESENT = 1 << 0
    WRITABLE = 1 << 1
    USER = 1 << 2
    ACCESSED = 1 << 3
    DIRTY = 1 << 4
    NO_EXECUTE = 1 << 63

@dataclass
class PageTableEntry:
    frame_number: int
    flags: PageFlags
    
    def is_present(self) -> bool:
        return bool(self.flags & PageFlags.PRESENT)
    
    def is_writable(self) -> bool:
        return bool(self.flags & PageFlags.WRITABLE)

class PageTable:
    """Simple single-level page table for demonstration."""
    
    def __init__(self, page_size: int = 4096):
        self.page_size = page_size
        self.entries: Dict[int, PageTableEntry] = {}
        self.tlb: Dict[int, int] = {}  # Virtual page -> Physical frame cache
        self.tlb_hits = 0
        self.tlb_misses = 0
    
    def map_page(self, virtual_page: int, physical_frame: int, 
                 flags: PageFlags = PageFlags.PRESENT | PageFlags.WRITABLE) -> None:
        """Map a virtual page to a physical frame."""
        self.entries[virtual_page] = PageTableEntry(physical_frame, flags)
        # Invalidate TLB entry if exists
        if virtual_page in self.tlb:
            del self.tlb[virtual_page]
    
    def unmap_page(self, virtual_page: int) -> None:
        """Unmap a virtual page."""
        if virtual_page in self.entries:
            del self.entries[virtual_page]
        if virtual_page in self.tlb:
            del self.tlb[virtual_page]
    
    def translate(self, virtual_addr: int, write: bool = False) -> int:
        """Translate a virtual address to physical address."""
        virtual_page = virtual_addr // self.page_size
        offset = virtual_addr % self.page_size
        
        # Check TLB first
        if virtual_page in self.tlb:
            self.tlb_hits += 1
            physical_frame = self.tlb[virtual_page]
        else:
            self.tlb_misses += 1
            
            # Walk page table
            if virtual_page not in self.entries:
                raise PageFault(virtual_addr, "Page not mapped")
            
            entry = self.entries[virtual_page]
            
            if not entry.is_present():
                raise PageFault(virtual_addr, "Page not present")
            
            if write and not entry.is_writable():
                raise PageFault(virtual_addr, "Write to read-only page")
            
            # Update accessed/dirty bits
            entry.flags |= PageFlags.ACCESSED
            if write:
                entry.flags |= PageFlags.DIRTY
            
            physical_frame = entry.frame_number
            
            # Cache in TLB
            self.tlb[virtual_page] = physical_frame
        
        return physical_frame * self.page_size + offset
    
    def flush_tlb(self) -> None:
        """Flush entire TLB."""
        self.tlb.clear()
    
    def get_stats(self) -> dict:
        """Return TLB statistics."""
        total = self.tlb_hits + self.tlb_misses
        hit_rate = self.tlb_hits / total if total > 0 else 0
        return {
            "tlb_hits": self.tlb_hits,
            "tlb_misses": self.tlb_misses,
            "hit_rate": hit_rate
        }


class PageFault(Exception):
    def __init__(self, address: int, reason: str):
        self.address = address
        self.reason = reason
        super().__init__(f"Page fault at 0x{address:x}: {reason}")


class VirtualMemorySimulator:
    """Simulates a simple virtual memory system."""
    
    def __init__(self, physical_memory_size: int = 1024 * 1024):
        self.page_table = PageTable()
        self.physical_memory = bytearray(physical_memory_size)
        self.next_free_frame = 0
        self.page_size = 4096
    
    def allocate_frame(self) -> int:
        """Allocate a physical frame."""
        frame = self.next_free_frame
        self.next_free_frame += 1
        return frame
    
    def mmap(self, virtual_addr: int, size: int) -> None:
        """Map a region of virtual memory."""
        start_page = virtual_addr // self.page_size
        num_pages = (size + self.page_size - 1) // self.page_size
        
        for i in range(num_pages):
            frame = self.allocate_frame()
            self.page_table.map_page(start_page + i, frame)
    
    def read(self, virtual_addr: int) -> int:
        """Read a byte from virtual memory."""
        try:
            physical_addr = self.page_table.translate(virtual_addr)
            return self.physical_memory[physical_addr]
        except PageFault as e:
            # Handle page fault - in real OS, would allocate/swap in
            print(f"Page fault: {e}")
            raise
    
    def write(self, virtual_addr: int, value: int) -> None:
        """Write a byte to virtual memory."""
        try:
            physical_addr = self.page_table.translate(virtual_addr, write=True)
            self.physical_memory[physical_addr] = value
        except PageFault as e:
            print(f"Page fault: {e}")
            raise


# Example usage
if __name__ == "__main__":
    vm = VirtualMemorySimulator()
    
    # Map 16KB starting at virtual address 0x1000
    vm.mmap(0x1000, 16384)
    
    # Write and read some data
    vm.write(0x1000, 42)
    vm.write(0x1001, 43)
    print(f"Read 0x1000: {vm.read(0x1000)}")  # 42
    print(f"Read 0x1001: {vm.read(0x1001)}")  # 43
    
    # Check TLB stats
    print(vm.page_table.get_stats())
    
    # Access unmapped memory
    try:
        vm.read(0x9000)  # Not mapped
    except PageFault as e:
        print(f"Caught: {e}")

This simplified implementation demonstrates:

Page table structure and lookup
TLB caching for fast translation
Page fault handling
Address decomposition into page number and offset

Production implementations add multi-level tables, hardware acceleration, and integration with the physical memory allocator.

16. Summary

Virtual memory is one of the most elegant abstractions in computing:

Core concepts:

Pages and frames divide memory into fixed-size chunks
Page tables map virtual addresses to physical addresses
Multi-level page tables handle sparse address spaces efficiently
TLBs cache translations for fast access

Key mechanisms:

Demand paging allocates memory lazily
Copy-on-write enables efficient process forking
Memory-mapped files provide zero-copy I/O
Huge pages reduce TLB pressure for large allocations

Performance considerations:

TLB misses can dominate memory-intensive workloads
NUMA awareness is critical for multi-socket systems
Page faults are expensive—minimize major faults
Memory layout affects cache and TLB efficiency

Security foundations:

Address space isolation protects processes from each other
ASLR randomizes memory layout to hinder exploits
W^X prevents code injection attacks
Memory tagging catches memory safety bugs in hardware

Understanding virtual memory transforms how you think about system performance. Whether you’re debugging mysterious slowdowns, optimizing database buffer pools, or securing applications against memory attacks, this knowledge is foundational to effective systems programming.