Virtual Memory and Page Tables: How Operating Systems Manage Memory

A comprehensive exploration of virtual memory systems, page tables, address translation, and the hardware-software collaboration that enables modern multitasking. Understand TLBs, page faults, and memory protection.
Every process believes it has the entire machine to itself. It sees a vast, contiguous address space starting from zero, completely isolated from other processes. This illusion is virtual memory—one of the most important abstractions in computing. Understanding how operating systems and hardware collaborate to maintain this illusion reveals fundamental insights about performance, security, and system design.
1. The Need for Virtual Memory
Before virtual memory, programming was a constant juggling act.
1.1 Problems with Physical Addressing
Early systems used physical addresses directly:
Program A loads at address 0x1000:
┌──────────────────────────────────────────────┐
│ 0x0000 │ 0x1000 │ 0x2000 │ 0x3000 │
│ OS │Program A │ Free │ Free │
└──────────────────────────────────────────────┘
Problems:
1. Relocation: Programs must know their load address
- Code compiled for 0x1000 won't work at 0x2000
- Must recompile or use position-independent code
2. Protection: Nothing stops A from accessing OS memory
- Buggy program can crash entire system
- Malicious program can read other programs' data
3. Fragmentation: Memory becomes unusable swiss cheese
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ OS │Free│ A │Free│ B │Free│ C │Free│
└────┴────┴────┴────┴────┴────┴────┴────┘
Total free: 400KB, but largest contiguous: 100KB
4. Limited space: Programs limited to physical RAM
- 16MB RAM = 16MB maximum program size
- No way to run larger programs
1.2 Virtual Memory Goals
Virtual memory provides:
1. Isolation
Each process sees private address space
Process A's address 0x1000 ≠ Process B's 0x1000
2. Protection
Hardware enforces access permissions
Read-only code, no-execute data, kernel-only regions
3. Simplified programming
Every program links at same virtual address
Compiler doesn't need to know load location
4. Memory as abstraction
Virtual space can exceed physical RAM
OS pages data to disk transparently
5. Sharing
Multiple processes can map same physical page
Shared libraries loaded once, mapped many times
1.3 Address Space Layout
Typical 64-bit Linux process virtual address space:
0xFFFFFFFFFFFFFFFF ┌─────────────────────────────┐
│ Kernel Space │ ← Shared across all processes
0xFFFF800000000000 ├─────────────────────────────┤
│ (Unused/Guard) │
├─────────────────────────────┤
│ Stack │ ← Grows downward
│ ↓ │
├─────────────────────────────┤
│ Memory Mapped Region │ ← Libraries, mmap files
├─────────────────────────────┤
│ ↑ │
│ Heap │ ← Grows upward
├─────────────────────────────┤
│ BSS │ ← Uninitialized data
├─────────────────────────────┤
│ Data │ ← Initialized data
├─────────────────────────────┤
│ Text │ ← Program code
0x0000000000400000 ├─────────────────────────────┤
│ (Unmapped) │ ← Catch NULL derefs
0x0000000000000000 └─────────────────────────────┘
2. Pages and Frames
The fundamental unit of virtual memory is the page.
2.1 Dividing Memory into Pages
Virtual and physical memory divided into fixed-size blocks:
Virtual Address Space Physical Memory
(Pages) (Frames)
┌─────────────┐ Page 0 ┌─────────────┐ Frame 0
│ │ │ │
├─────────────┤ Page 1 ├─────────────┤ Frame 1
│ │ │ │
├─────────────┤ Page 2 ├─────────────┤ Frame 2
│ │ ──────────► │ │
├─────────────┤ Page 3 ├─────────────┤ Frame 3
│ │ │ │
├─────────────┤ Page 4 ├─────────────┤ Frame 4
│ │ ──────────► │ │
└─────────────┘ └─────────────┘
Page size is typically 4KB (4096 bytes)
Some systems support larger pages: 2MB, 1GB (huge pages)
2.2 Address Decomposition
A virtual address has two parts:
32-bit address with 4KB pages:
┌────────────────────┬────────────────────┐
│ Page Number │ Page Offset │
│ (20 bits) │ (12 bits) │
└────────────────────┴────────────────────┘
2^20 = 1M pages 2^12 = 4KB per page
Example: Virtual address 0x12345678
Binary: 0001 0010 0011 0100 0101 0110 0111 1000
Page Number: 0x12345 (top 20 bits)
Page Offset: 0x678 (bottom 12 bits)
Translation:
1. Look up page number 0x12345 in page table
2. Get physical frame number (e.g., 0xABCDE)
3. Physical address = frame number + offset
0xABCDE << 12 | 0x678 = 0xABCDE678
2.3 Why Fixed-Size Pages?
Advantages of fixed-size pages:
1. Simple allocation
- Any free frame can satisfy any page
- No external fragmentation
- Bitmap or free list tracking
2. Efficient swapping
- Swap page-sized chunks to disk
- Predictable I/O sizes
3. Hardware simplicity
- Page table entry size is fixed
- Address translation is bit manipulation
Disadvantages:
1. Internal fragmentation
- 4097 bytes needs 2 pages (wastes 4095 bytes)
- Average waste: half a page per allocation
2. Page table size
- Must map entire address space
- 4KB pages in 48-bit space = huge tables
Trade-off: Larger pages reduce table size but increase fragmentation
3. Page Tables
The data structure that maps virtual to physical addresses.
3.1 Simple Flat Page Table
Conceptually, a page table is an array:
Page Table for Process A:
┌─────────┬─────────────┬───────────────────┐
│ Index │ Frame Number│ Flags │
├─────────┼─────────────┼───────────────────┤
│ 0 │ 0x00123 │ Present, RW │
│ 1 │ 0x00456 │ Present, RO │
│ 2 │ --- │ Not Present │
│ 3 │ 0x00789 │ Present, RW, User │
│ ... │ ... │ ... │
│ 1M-1 │ 0xFFFFF │ Present, RW │
└─────────┴─────────────┴───────────────────┘
Problem: For 32-bit address space with 4KB pages:
- 2^20 = 1,048,576 page table entries
- Each entry ~4 bytes
- Page table = 4MB per process!
For 64-bit with 48-bit virtual addresses:
- 2^36 entries = 68 billion entries
- Completely impractical
3.2 Multi-Level Page Tables
Solution: Hierarchical page tables
Only allocate table portions that are actually used
Two-Level Page Table (32-bit x86):
Virtual Address: 0x12345678
┌──────────┬──────────┬────────────┐
│ Dir (10) │Table (10)│Offset (12) │
└──────────┴──────────┴────────────┘
0x48 0x345 0x678
Page Directory Page Table Physical Memory
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Entry 0 │ │ │ │ │
├─────────────┤ ├─────────────┤ ├─────────────┤
│ ... │ │ │ │ │
├─────────────┤ ├─────────────┤ ├─────────────┤
│ Entry 0x48 │───────────►│ Entry 0x345 │────────►│ Frame │
├─────────────┤ ├─────────────┤ ├─────────────┤
│ ... │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Benefits:
- Sparse address spaces need few page tables
- Unused regions don't need table entries
- Trade: Extra memory access per level
3.3 Four-Level Page Tables (x86-64)
Modern x86-64 uses 4-level paging (48-bit virtual addresses):
Virtual Address breakdown:
┌───────┬───────┬───────┬───────┬────────────┐
│PML4(9)│PDP(9) │PD(9) │PT(9) │Offset(12) │
└───────┴───────┴───────┴───────┴────────────┘
512 512 512 512 4096
Levels:
PML4 - Page Map Level 4 (512 entries)
└─► PDP - Page Directory Pointer (512 entries each)
└─► PD - Page Directory (512 entries each)
└─► PT - Page Table (512 entries each)
└─► 4KB Physical Page
Each entry is 8 bytes (64-bit pointers + flags)
Each table is 4KB (512 × 8 bytes = 4096)
Maximum addressable: 2^48 = 256 TB
Typical process uses tiny fraction of address space
3.4 Page Table Entry Format
x86-64 Page Table Entry (PTE):
Bit 63 Bit 12 Bit 11-9 Bit 8-0
┌──────────────┬───────────┬───────────┬────────────┐
│ NX │Reserved │Frame Addr │ Avail │ Flags │
└──────────────┴───────────┴───────────┴────────────┘
Key flags (bits 0-11):
Bit 0 (P): Present - page is in physical memory
Bit 1 (R/W): Read/Write - 0=read-only, 1=writable
Bit 2 (U/S): User/Supervisor - 0=kernel only, 1=user accessible
Bit 3 (PWT): Page Write-Through - caching policy
Bit 4 (PCD): Page Cache Disable
Bit 5 (A): Accessed - set by hardware on access
Bit 6 (D): Dirty - set by hardware on write
Bit 7 (PS): Page Size - 1=huge page (2MB/1GB)
Bit 63 (NX): No Execute - prevent code execution
Frame address: Physical frame number (bits 12-51)
4. Address Translation in Hardware
The CPU performs translation on every memory access.
4.1 Translation Process
CPU executes: mov eax, [0x12345678]
1. Extract page table indices from virtual address
PML4 index: bits 47-39 = 0
PDP index: bits 38-30 = 0
PD index: bits 29-21 = 0x91 (145)
PT index: bits 20-12 = 0x45 (69)
Offset: bits 11-0 = 0x678
2. Walk the page table hierarchy
CR3 register points to PML4 base address
PML4[0] → PDP base address
PDP[0] → PD base address
PD[0x91] → PT base address
PT[0x45] → Physical frame + flags
3. Check permissions
If not present → Page Fault
If user accessing kernel page → Page Fault
If writing read-only page → Page Fault
4. Compute physical address
Frame number from PTE + offset = physical address
4.2 Translation Lookaside Buffer (TLB)
Problem: Page table walk requires 4 memory accesses per translation
Solution: Cache recent translations in TLB
TLB: Hardware cache of page table entries
┌─────────────────┬────────────────┬─────────────┐
│ Virtual Page # │ Physical Frame │ Flags │
├─────────────────┼────────────────┼─────────────┤
│ 0x12345 │ 0xABCDE │ RW, User │
│ 0x00001 │ 0x00042 │ RO, User │
│ 0x7FFFF │ 0x12345 │ RW, Kernel │
│ ... │ ... │ ... │
└─────────────────┴────────────────┴─────────────┘
TLB characteristics:
- Fully associative or set-associative
- Typically 64-1024 entries
- Split I-TLB and D-TLB common
- Hit rate > 99% for most workloads
TLB hit: ~1 cycle (included in memory access)
TLB miss: ~10-100 cycles (page table walk)
4.3 TLB Management
TLB must be kept consistent with page tables:
Context switch:
- New process has different page tables
- Old TLB entries are invalid
- Option 1: Flush entire TLB (expensive)
- Option 2: Tag entries with ASID (Address Space ID)
Page table updates:
- OS modifies page table entry
- Must invalidate corresponding TLB entry
- invlpg instruction on x86
TLB shootdown (multiprocessor):
1. CPU 0 modifies page table
2. CPU 0 invalidates local TLB entry
3. CPU 0 sends IPI to other CPUs
4. Other CPUs invalidate their TLB entries
5. CPU 0 waits for acknowledgment
Very expensive! Minimized by batching updates
4.4 Hardware Page Table Walker
Modern CPUs have dedicated page table walk hardware:
┌─────────────────────────────────────────────────────┐
│ CPU │
│ ┌──────────┐ ┌───────┐ ┌─────────────────┐ │
│ │ Core │───►│ TLB │───►│ Page Table │ │
│ │ │ │ │ │ Walker (PTW) │ │
│ └──────────┘ └───────┘ └─────────────────┘ │
│ │ │ │ │
│ │ TLB Hit TLB Miss │
│ ▼ │ │ │
│ ┌──────────┐ │ ▼ │
│ │ Memory │◄────────┴───────────────┘ │
│ │Controller│ │
│ └──────────┘ │
└─────────────────────────────────────────────────────┘
PTW features:
- Runs in parallel with CPU execution
- Multiple outstanding walks possible
- Caches intermediate page table entries
- Can prefetch based on access patterns
5. Page Faults
When translation fails, the OS takes over.
5.1 Types of Page Faults
Page fault occurs when:
1. Page not present (P bit = 0)
2. Permission violation (write to RO, user to kernel)
3. Reserved bit violation
Fault types by cause:
Minor fault (soft fault):
- Page is in memory but not mapped
- Just update page table, no I/O
- Example: Copy-on-write page accessed
Major fault (hard fault):
- Page must be read from disk
- Significant latency (milliseconds)
- Example: Swapped-out page accessed
Invalid fault:
- Access to truly invalid address
- Results in SIGSEGV (segmentation fault)
- Example: NULL pointer dereference
5.2 Demand Paging
Pages loaded only when accessed:
Program starts:
┌────────────────────────────────────────────────┐
│ Text │ Data │ BSS │ Heap/Stack │
└────────────────────────────────────────────────┘
All pages marked "not present" initially
First instruction fetch:
1. CPU tries to read from text segment
2. TLB miss, page table walk
3. Page not present → Page fault
4. OS loads page from executable file
5. Maps page, marks present
6. Returns to instruction, retry succeeds
Benefits:
- Fast program startup
- Only load pages actually used
- Many code paths never executed
5.3 Copy-on-Write (COW)
Efficient process forking:
fork() without COW:
Parent: [Page A][Page B][Page C]
↓ copy all pages
Child: [Page A'][Page B'][Page C']
Problem: Expensive, child might exec() immediately
fork() with COW:
Parent: [Page A][Page B][Page C] ← Marked read-only
↘ ↓ ↙
Child: Shares same physical pages
When either process writes:
1. Page fault (writing to read-only page)
2. OS copies the page
3. Each process gets its own copy
4. Writing process page marked writable
Parent writes to Page B:
Parent: [Page A][Page B'][Page C] ← B' is new copy
Child: [Page A][Page B ][Page C] ← Still shares A and C
5.4 Page Fault Handler
// Simplified page fault handler logic
void page_fault_handler(fault_address, error_code) {
struct vm_area* vma = find_vma(current->mm, fault_address);
if (vma == NULL) {
// Address not in any mapped region
send_signal(current, SIGSEGV);
return;
}
if (!permissions_ok(vma, error_code)) {
// Permission violation
send_signal(current, SIGSEGV);
return;
}
if (is_cow_fault(vma, error_code)) {
// Copy-on-write
handle_cow(vma, fault_address);
return;
}
if (is_file_backed(vma)) {
// Memory-mapped file
page = read_page_from_file(vma->file, offset);
} else if (is_swap_backed(vma)) {
// Swapped out page
page = read_page_from_swap(swap_entry);
} else {
// Anonymous page (heap/stack)
page = allocate_zero_page();
}
// Map the page
map_page(current->mm, fault_address, page, vma->permissions);
}
6. Memory Protection
Virtual memory enables fine-grained access control.
6.1 Protection Bits
Each page has protection attributes:
Read (R): Can read from page
Write (W): Can write to page
Execute (X): Can execute code from page
Common combinations:
R--: Read-only data (constants, shared libraries)
RW-: Read-write data (heap, stack, globals)
R-X: Executable code (text segment)
RWX: Self-modifying code (JIT, avoid if possible)
User/Supervisor bit:
- U=1: User mode can access
- U=0: Kernel mode only
Protection prevents:
- Writing to code (code injection)
- Executing data (buffer overflow exploits)
- User accessing kernel memory
- Process accessing other process memory
6.2 Address Space Layout Randomization (ASLR)
Randomize virtual address layout for security:
Without ASLR (predictable):
Stack: 0x7FFFFFFFE000
Heap: 0x00602000
libc: 0x7FFFF7A00000
Binary: 0x00400000
With ASLR (randomized each run):
Run 1:
Stack: 0x7FFC12345000
Heap: 0x55A432100000
libc: 0x7F8901234000
Run 2:
Stack: 0x7FFD98765000
Heap: 0x562B87600000
libc: 0x7FA456789000
Makes exploitation harder:
- Attacker can't predict where things are
- Return-to-libc attacks need address leak
- Stack buffer overflows harder to exploit
6.3 Kernel Address Space Layout
Kernel/user separation:
Lower half (user): 0x0000000000000000 - 0x00007FFFFFFFFFFF
Upper half (kernel): 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF
Canonical address gap:
- Addresses 0x0000800000000000 - 0xFFFF7FFFFFFFFFFF invalid
- Hardware checks bit 47 is sign-extended through bits 48-63
- Provides 128TB user + 128TB kernel
KPTI (Kernel Page Table Isolation):
- Meltdown mitigation
- User page tables don't map kernel
- Switch page tables on kernel entry/exit
- Performance cost ~5% on syscall-heavy workloads
7. Swapping and Paging to Disk
Virtual memory can exceed physical RAM.
7.1 Page Replacement
When physical memory is full:
1. Select victim page to evict
2. If dirty, write to swap
3. Update page table (mark not present)
4. Use freed frame for new page
Page replacement algorithms:
FIFO (First In First Out):
- Evict oldest page
- Simple but ignores usage patterns
- Suffers from Belady's anomaly
LRU (Least Recently Used):
- Evict page unused longest
- Good approximation of optimal
- Expensive to implement exactly
Clock (Second Chance):
- Circular list of pages
- Check accessed bit, give second chance
- Approximates LRU cheaply
┌─────────────────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │─►│ B │─►│ C │─►│ D │ │
│ │A=1│ │A=0│ │A=1│ │A=0│◄─┐ │
│ └───┘ └───┘ └───┘ └───┘ │ │
│ ▲ │ │
│ └─────────────────────────┘ │
│ Clock hand │
└─────────────────────────────────────┘
7.2 Working Set Model
Working set: Pages actively used by process
Working Set Size over time:
│ ┌────────────┐
│ ┌──────────┐ │ │ ┌───────
│ │ │ │ │ │
│ │ └────┘ └────┘
└────┴─────────────────────────────────────────►
Phase 1 Transition Phase 2 Phase 3
Thrashing:
- Working set > available memory
- Constant page faults
- Process makes no progress
Detection:
- High page fault rate
- Low CPU utilization despite load
- Excessive disk I/O
Solutions:
- Reduce degree of multiprogramming
- Add more RAM
- Kill memory-hungry processes
7.3 Swap Space Management
Swap partition/file organization:
┌─────────────────────────────────────────────────────┐
│ Swap Space │
├─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┤
│ P1 │Free │ P2 │ P1 │Free │ P3 │ P2 │Free │ P1 │
│pg 5 │ │pg 2 │pg 8 │ │pg 1 │pg 9 │ │pg 3 │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Swap entry in page table:
When page is swapped out, PTE contains:
- Present bit = 0
- Swap device/file identifier
- Offset within swap space
Linux swap organization:
- Swap areas (partitions or files)
- Priority ordering (faster swap first)
- Swap clusters for sequential I/O
- Frontswap for compressed memory
7.4 Memory Pressure Handling
Linux memory management:
Free memory watermarks:
┌──────────────────────────────────────────────────┐
│ │
│ High watermark ───────────────────────────── │
│ (comfortable, no action needed) │
│ │
│ Low watermark ───────────────────────────── │
│ (kswapd wakes up, background reclaim) │
│ │
│ Min watermark ───────────────────────────── │
│ (direct reclaim, allocations block) │
│ │
│ Out of memory ───────────────────────────── │
│ (OOM killer invoked) │
└──────────────────────────────────────────────────┘
Reclaim targets:
1. Page cache (clean file pages)
2. Dirty file pages (write back first)
3. Anonymous pages (swap out)
4. Slab caches (kernel allocations)
8. Memory-Mapped Files
Mapping files directly into address space.
8.1 mmap() System Call
// Map file into memory
int fd = open("data.bin", O_RDWR);
struct stat st;
fstat(fd, &st);
void* addr = mmap(
NULL, // Let kernel choose address
st.st_size, // Map entire file
PROT_READ | PROT_WRITE, // Read and write access
MAP_SHARED, // Changes visible to other processes
fd, // File descriptor
0 // Offset in file
);
// Now access file like memory
char* data = (char*)addr;
data[0] = 'H'; // Writes to file (eventually)
// Unmap when done
munmap(addr, st.st_size);
close(fd);
8.2 Private vs Shared Mappings
MAP_SHARED:
- Changes written back to file
- Changes visible to other processes
- Used for: IPC, shared databases
Process A: [Page]──┐
├──► Physical Frame ◄──► File on disk
Process B: [Page]──┘
MAP_PRIVATE:
- Changes are copy-on-write
- Changes NOT written to file
- Used for: Loading executables, private copies
Process A: [Page]──┐
├──► Physical Frame (COW)
Process B: [Page]──┘
│
▼ (after write)
[Page A']──► Different Frame (private copy)
8.3 Memory-Mapped I/O Benefits
Traditional read() vs mmap():
read() approach:
1. System call overhead
2. Copy from kernel buffer to user buffer
3. Sequential access pattern assumed
mmap() approach:
1. One-time setup cost
2. Zero-copy access (page table trick)
3. Random access efficient
4. Automatic caching via page cache
When to use mmap:
✓ Large files with random access
✓ Shared memory between processes
✓ Memory-mapping hardware devices
✓ Efficient file-backed data structures
When to use read/write:
✓ Sequential access patterns
✓ Small files
✓ Portability concerns
✓ Fine-grained error handling needed
8.4 Anonymous Mappings
Memory not backed by any file:
// Allocate 1GB of anonymous memory
void* mem = mmap(
NULL,
1UL << 30, // 1GB
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, // No file
0
);
// Memory is zero-initialized (lazily)
// Pages allocated on first access
Uses:
- Large heap allocations (malloc uses for big allocs)
- Stack growth
- JIT compilation buffers
Backed by:
- Zero page initially (read)
- Anonymous frames on write
- Swap space if swapped out
9. Huge Pages
Larger pages for better performance.
9.1 TLB Pressure Problem
Standard 4KB pages:
- 1GB of memory = 262,144 pages
- TLB might hold 1024 entries
- TLB covers only 4MB
- High miss rate for large data
Huge pages (2MB):
- 1GB = 512 huge pages
- Same TLB covers 1GB
- Dramatically fewer misses
Huge pages (1GB):
- 1GB = 1 page
- Single TLB entry covers all
- Best for truly huge allocations
9.2 Transparent Huge Pages (THP)
Linux can automatically use huge pages:
Configuration:
/sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
always: System tries to use huge pages everywhere
madvise: Only where application requests
never: Disabled
Benefits:
- No application changes needed
- Reduced TLB pressure
- Less page table overhead
Drawbacks:
- Memory fragmentation can prevent huge pages
- Compaction overhead (khugepaged)
- Memory waste (internal fragmentation)
- Latency spikes during promotion/demotion
9.3 Explicit Huge Pages
// Using hugetlbfs
#include <sys/mman.h>
// Allocate 2MB huge page
void* huge = mmap(
NULL,
2 * 1024 * 1024,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1,
0
);
// Or using madvise
void* regular = mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
madvise(regular, size, MADV_HUGEPAGE);
// Database use case: pre-allocate huge pages at boot
// Reserve: echo 1024 > /proc/sys/vm/nr_hugepages
// Mount: mount -t hugetlbfs none /mnt/huge
// Application maps files from /mnt/huge
9.4 Huge Page Trade-offs
Advantages:
+ Fewer TLB entries needed
+ Smaller page tables
+ Faster page table walks
+ Better for large contiguous data
Disadvantages:
- Memory fragmentation (need contiguous 2MB/1GB)
- Internal fragmentation (waste for small allocs)
- Longer page fault handling
- Copy-on-write copies more data
- Swapping granularity larger
Best for:
- Databases (large buffer pools)
- Scientific computing (large arrays)
- Virtual machines (guest RAM)
- In-memory caches
Avoid for:
- Small allocations
- Short-lived processes
- Memory-constrained systems
10. NUMA and Memory Locality
Non-Uniform Memory Access in multi-socket systems.
10.1 NUMA Architecture
Uniform Memory Access (UMA):
┌─────────┐ ┌─────────┐
│ CPU 0 │ │ CPU 1 │
└────┬────┘ └────┬────┘
│ │
└─────┬──────┘
│
┌─────┴─────┐
│ Memory │ ← Same latency from both CPUs
└───────────┘
Non-Uniform Memory Access (NUMA):
┌─────────┐ ┌─────────┐
│ CPU 0 │ │ CPU 1 │
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ QPI/UPI ┌────┴────┐
│ Memory 0│◄─────────►│ Memory 1│
└─────────┘ └─────────┘
(local) (remote) (local)
~70ns ~120ns ~70ns
Local access: Fast
Remote access: Slower (cross-socket interconnect)
10.2 NUMA-Aware Allocation
// Linux NUMA API
#include <numa.h>
#include <numaif.h>
// Check NUMA availability
if (numa_available() == -1) {
// NUMA not available
}
// Allocate on specific node
void* local = numa_alloc_onnode(size, 0); // Node 0
void* remote = numa_alloc_onnode(size, 1); // Node 1
// Allocate interleaved across nodes
void* interleaved = numa_alloc_interleaved(size);
// Bind memory policy
unsigned long nodemask = 1; // Node 0 only
set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)*8);
// Migrate pages to local node
numa_migrate_pages(pid, &from_nodes, &to_nodes);
10.3 First-Touch Policy
Default Linux policy: First touch
Page allocated on node where first accessed:
// Thread on Node 0 allocates
char* data = malloc(1GB); // No physical pages yet
// Thread on Node 1 first touches
memset(data, 0, 1GB); // Pages allocated on Node 1!
// Thread on Node 0 accesses → remote!
Problem for parallel initialization:
// Main thread allocates
data = malloc(large_size);
memset(data, 0, large_size); // All on main thread's node
// Worker threads access → all remote!
Solution: Parallel first touch
#pragma omp parallel for
for (int i = 0; i < size; i += PAGE_SIZE) {
data[i] = 0; // Each thread touches its portion
}
10.4 NUMA Balancing
Automatic NUMA balancing (Linux):
1. Periodically scan process memory
2. Identify pages accessed from wrong node
3. Migrate pages closer to accessing CPU
Implementation:
- unmaps pages periodically
- Page fault reveals accessing CPU
- Migration if remote access detected
Enable/disable:
echo 1 > /proc/sys/kernel/numa_balancing
Trade-offs:
+ Adapts to changing access patterns
+ No application changes needed
- CPU overhead for scanning
- Migration overhead
- May fight with application's own policy
11. Kernel Virtual Memory
How the kernel manages its own address space.
11.1 Kernel Address Space Layout
Linux x86-64 kernel memory layout:
0xFFFFFFFFFFFFFFFF ┌──────────────────────────────┐
│ Fixed mappings │ ← APIC, etc.
0xFFFFFFFFFE000000 ├──────────────────────────────┤
│ Modules │ ← Loadable modules
0xFFFFFFFFC0000000 ├──────────────────────────────┤
│ vmemmap │ ← Page descriptors
0xFFFFEA0000000000 ├──────────────────────────────┤
│ vmalloc space │ ← Non-contiguous allocs
0xFFFFC90000000000 ├──────────────────────────────┤
│ Direct mapping │ ← All physical RAM
0xFFFF880000000000 ├──────────────────────────────┤
│ (guard hole) │
0xFFFF800000000000 └──────────────────────────────┘
11.2 Direct Mapping
All physical RAM mapped linearly:
Physical: 0x00000000 0x00001000 0x00002000 ...
│ │ │
Virtual: 0xFFFF880000000000 ...
│ │ │
page_offset + phys = virt
Benefits:
- Simple physical ↔ virtual conversion
- All kernel data accessible without mapping
- Page tables themselves in direct map
Conversion macros:
__pa(virt) → physical address
__va(phys) → virtual address
phys_to_virt(phys) → virtual address
virt_to_phys(virt) → physical address
11.3 vmalloc Area
For large, non-contiguous kernel allocations:
kmalloc: Physically contiguous
vmalloc: Virtually contiguous, physically fragmented
Physical Memory: vmalloc Virtual Space:
┌───┐ ┌───────────────────┐
│ A │ │ ┌───┬───┬───┐ │
├───┤ │ │ A │ B │ C │ │
│///│ (used) │ └───┴───┴───┘ │
├───┤ │ Contiguous │
│ B │ └───────────────────┘
├───┤
│///│
├───┤
│ C │
└───┘
Use cases:
- Loading kernel modules
- Large buffers where contiguity not needed
- When physical memory is fragmented
Cost:
- Requires page table entries
- TLB pressure
- Slightly slower access than kmalloc
11.4 Kernel Memory Allocation
// Kernel allocation functions
// Small, physically contiguous
void* p = kmalloc(size, GFP_KERNEL);
kfree(p);
// Page-aligned, physically contiguous
struct page* page = alloc_pages(GFP_KERNEL, order);
void* addr = page_address(page);
free_pages(addr, order);
// Virtually contiguous (may be physically scattered)
void* v = vmalloc(large_size);
vfree(v);
// DMA-capable (specific physical constraints)
void* dma = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
// Slab allocator (object caching)
struct kmem_cache* cache = kmem_cache_create("my_objects",
sizeof(struct my_object), 0, 0, NULL);
struct my_object* obj = kmem_cache_alloc(cache, GFP_KERNEL);
kmem_cache_free(cache, obj);
12. Virtual Memory in Virtualization
Additional translation layers for virtual machines.
12.1 Shadow Page Tables
First-generation virtualization:
Guest virtual → Guest physical → Host physical
(Guest OS) (VMM)
Shadow page tables:
- VMM maintains shadow copies of guest page tables
- Shadow maps: Guest virtual → Host physical directly
- Guest page table changes trapped and synchronized
Guest Page Table: Shadow Page Table:
GVA → GPA GVA → HPA
┌─────┬─────┐ ┌─────┬─────┐
│ 0x1 │ 0xA │ │ 0x1 │ 0x50│
│ 0x2 │ 0xB │ ──────► │ 0x2 │ 0x51│
│ 0x3 │ 0xC │ │ 0x3 │ 0x52│
└─────┴─────┘ └─────┴─────┘
GPA → HPA mapping:
0xA → 0x50
0xB → 0x51
0xC → 0x52
12.2 Hardware-Assisted (Nested) Paging
Modern CPUs: EPT (Intel) / NPT (AMD)
Two levels of translation in hardware:
Guest Virtual Address (GVA)
│
▼ Guest page tables
Guest Physical Address (GPA)
│
▼ Extended/Nested page tables
Host Physical Address (HPA)
Benefits:
- No shadow page table maintenance
- Guest can modify its page tables freely
- Fewer VM exits
Costs:
- More levels to walk (up to 24 memory accesses!)
- Larger TLB entries (VPID + ASID)
- Still expensive on TLB miss
12.3 Memory Overcommitment
Giving VMs more memory than physically available:
Host has 64GB RAM
VM1: 48GB allocated
VM2: 48GB allocated
VM3: 48GB allocated
Total: 144GB > 64GB physical
Techniques:
1. Ballooning
- Balloon driver in guest "inflates"
- Guest OS pages out its own memory
- Host reclaims balloon pages
2. Page deduplication (KSM)
- Scan for identical pages across VMs
- Map to single physical page (COW)
- Common OS pages shared
3. Swap to host
- VMM pages out entire guest pages
- Guest unaware
- Poor performance if thrashing
4. Memory compression
- Compress cold pages in memory
- Faster than disk, saves space
13. Performance Considerations
Optimizing for virtual memory behavior.
13.1 TLB Optimization
Maximize TLB coverage:
1. Use huge pages for large data
Regular: 4KB × 1024 TLB entries = 4MB coverage
Huge: 2MB × 1024 TLB entries = 2TB coverage
2. Improve locality
- Access memory sequentially when possible
- Keep working set in as few pages as possible
- Avoid pointer chasing across many pages
3. Reduce context switches
- Each switch may flush TLB (without PCID)
- Batch work to reduce switches
4. Pin critical data
- mlock() to prevent swapping
- Ensures TLB entries remain valid
13.2 Page Fault Optimization
Minimize page faults:
1. Prefetch data
- madvise(MADV_WILLNEED) hints to kernel
- readahead() for file-backed mappings
2. Lock pages for real-time
- mlockall(MCL_CURRENT | MCL_FUTURE)
- Prevents any page-out, no major faults
3. Pre-touch memory
- Access all pages after mmap
- Takes faults upfront, not during critical path
4. Use MAP_POPULATE
- Pre-fault all pages at mmap time
- Slower setup, no faults later
// Pre-population example
void* mem = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
13.3 NUMA Optimization
For NUMA systems:
1. Measure first
numastat # System-wide stats
numastat -p pid # Per-process stats
2. Bind processes to nodes
numactl --cpunodebind=0 --membind=0 ./app
3. Interleave for bandwidth
numactl --interleave=all ./bandwidth_heavy_app
4. Application-level awareness
- Query topology: numa_num_configured_nodes()
- Allocate per-thread data on local node
- Avoid false sharing across nodes
5. Monitor migrations
/proc/vmstat | grep numa
numa_hit, numa_miss, numa_foreign
13.4 Memory Bandwidth
Bandwidth bottlenecks:
Modern CPUs:
- Cache bandwidth: 100+ GB/s
- Memory bandwidth: 20-50 GB/s per channel
- Many-core chips can easily saturate memory
Optimization strategies:
1. Cache blocking / tiling
Process data in cache-sized chunks
2. Non-temporal stores
Bypass cache for write-only data
_mm_stream_si128() intrinsics
3. Memory-bound parallelism
More threads don't help beyond bandwidth limit
May hurt due to cache thrashing
4. Prefetching
Hide memory latency with lookahead
Hardware prefetch + software hints
14. Debugging Virtual Memory Issues
Tools and techniques for memory problems.
14.1 Process Memory Inspection
# Memory maps
cat /proc/PID/maps
7f8a12340000-7f8a12540000 r-xp 00000000 08:01 123456 /lib/libc.so.6
7f8a12540000-7f8a12740000 ---p 00200000 08:01 123456 /lib/libc.so.6
7f8a12740000-7f8a12744000 r--p 00200000 08:01 123456 /lib/libc.so.6
# Detailed memory stats
cat /proc/PID/status | grep -i mem
VmPeak: 1234 kB # Peak virtual memory
VmSize: 1200 kB # Current virtual memory
VmRSS: 800 kB # Resident set size
VmSwap: 100 kB # Swapped out memory
# Per-mapping details
cat /proc/PID/smaps
# Shows RSS, PSS, swap per mapping
# Page table stats
cat /proc/PID/pagetypeinfo
14.2 System Memory Analysis
# Overall memory
free -h
total used free shared buff/cache available
Mem: 15Gi 8.0Gi 2.0Gi 500Mi 5.5Gi 6.5Gi
# Detailed breakdown
cat /proc/meminfo
MemTotal: 16384000 kB
MemFree: 2048000 kB
MemAvailable: 6656000 kB
Buffers: 512000 kB
Cached: 5120000 kB
SwapTotal: 8192000 kB
SwapFree: 7168000 kB
Dirty: 12000 kB
AnonPages: 5000000 kB
Mapped: 1000000 kB
Shmem: 500000 kB
PageTables: 50000 kB
HugePages_Total: 0
HugePages_Free: 0
14.3 Page Table Analysis
# Page table overhead
grep PageTables /proc/meminfo
PageTables: 50000 kB
# Per-process page table size
cat /proc/PID/status | grep VmPTE
VmPTE: 5000 kB
# TLB statistics (requires perf)
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./app
# Example output:
# 1,000,000,000 dTLB-loads
# 1,000,000 dTLB-load-misses # 0.1% miss rate
# 500,000,000 iTLB-loads
# 100,000 iTLB-load-misses # 0.02% miss rate
14.4 Common Problems and Solutions
High page fault rate:
- Check if swapping: vmstat 1
- Pre-touch memory: memset after mmap
- Use huge pages for large allocations
- Increase memory or reduce working set
TLB thrashing:
- Use huge pages
- Improve memory locality
- Reduce process count (fewer TLB flushes)
- Check for excessive mmap/munmap
NUMA imbalance:
- numastat shows hits vs misses
- Check thread-to-memory binding
- Consider interleaving for bandwidth workloads
Page table bloat:
- Large sparse address spaces waste page tables
- Consider madvise(MADV_DONTNEED) for unused regions
- Compact allocations when possible
OOM kills:
- Review overcommit settings
- Add swap space
- Set oom_score_adj for important processes
- Use cgroups memory limits
15. Advanced Topics
Cutting-edge virtual memory techniques.
15.1 Memory Tagging
ARM Memory Tagging Extension (MTE):
Each 16-byte granule has 4-bit tag:
┌────────────────────────────────────────────────┐
│ Pointer: 0x1234_5678_9ABC_DEF0 │
│ Tag: ────────────────────0x5 │
│ │
│ Memory at 0x...9ABC_DEF0 has tag 0x5 │
│ Access with tag 0x5: OK │
│ Access with tag 0x3: Hardware exception! │
└────────────────────────────────────────────────┘
Use cases:
- Use-after-free detection
- Buffer overflow detection
- Memory safety without full bounds checking
Hardware support:
- Tags stored in memory (extra bits)
- Checked on every access
- Minimal performance overhead
15.2 Persistent Memory
Non-Volatile Memory (NVM):
byte-addressable persistent storage:
- Survives power loss like disk
- Accessed like memory (load/store)
- Latency ~100-300ns (between DRAM and SSD)
Programming model:
DAX (Direct Access) - bypass page cache
mmap() directly to NVM
Stores persist... eventually
Challenges:
- Cache flush ordering
- Atomic update guarantees
- Recovery after crash
// Persistent store pattern
store(data, address);
clwb(address); // Cache line write-back
sfence(); // Store fence
15.3 Heterogeneous Memory
Systems with multiple memory types:
Example: DRAM + NVM + HBM
- DRAM: Fast, expensive, volatile
- NVM: Slower, cheaper, persistent
- HBM: Fastest, very expensive
Tiered memory:
Hot data → Fast tier (DRAM/HBM)
Cold data → Slow tier (NVM)
Linux support:
- Memory tiering (kernel 5.14+)
- Automatic page migration
- NUMA-like node representation
Intel Optane / CXL memory:
- Attached via CXL interconnect
- Latency higher than local DRAM
- Capacity expansion use case
15.4 Memory Disaggregation
Future: Memory as network resource
Traditional:
┌─────────────────────┐
│ Server 1 │
│ CPU ←──► Memory │
└─────────────────────┘
Disaggregated:
┌─────────────────────┐ ┌───────────────┐
│ Compute Node │ ◄────► │ Memory Pool │
│ CPU only │ RDMA │ (shared) │
└─────────────────────┘ └───────────────┘
┌─────────────────────┐ ▲
│ Compute Node │ ◄───────────┘
│ CPU only │
└─────────────────────┘
Benefits:
- Independent scaling of compute/memory
- Better utilization (pool shared memory)
- Failure isolation
Challenges:
- Network latency in critical path
- Complex consistency models
- New programming models needed
16. Summary and Best Practices
Key takeaways for working with virtual memory.
16.1 Core Concepts Review
Virtual memory provides:
✓ Isolation between processes
✓ Protection (R/W/X permissions)
✓ Abstraction (address space > physical RAM)
✓ Sharing (libraries, copy-on-write)
Key mechanisms:
- Page tables map virtual → physical
- TLB caches translations
- Page faults handle on-demand loading
- Swap extends memory to disk
Performance factors:
- TLB coverage and hit rate
- Page fault frequency
- NUMA locality
- Cache behavior
16.2 Practical Guidelines
For application developers:
1. Understand your allocator
- Large allocations use mmap
- Small allocations from heap
- Consider jemalloc/tcmalloc for heavy allocation
2. Use huge pages for large data
- madvise(MADV_HUGEPAGE)
- Or MAP_HUGETLB explicitly
3. Consider NUMA on multi-socket
- First-touch placement matters
- Profile with numastat
4. Avoid excessive virtual memory
- Each mmap has overhead
- Don't map huge sparse ranges
5. Lock memory for latency-critical paths
- mlockall() or mlock()
- Prevents page faults in hot paths
For system administrators:
1. Monitor memory pressure
- Watch for swap usage
- Check for OOM events
2. Tune overcommit policy
- /proc/sys/vm/overcommit_memory
3. Configure huge pages appropriately
- Reserve at boot for guaranteed availability
4. Balance swappiness
- /proc/sys/vm/swappiness
- Lower for latency, higher for throughput
16.3 Debugging Checklist
When investigating memory issues:
□ Check overall memory usage (free, /proc/meminfo)
□ Examine process memory (pmap, /proc/PID/smaps)
□ Look for memory leaks (valgrind, AddressSanitizer)
□ Check page fault rates (perf stat)
□ Examine TLB behavior (perf stat TLB events)
□ Review NUMA placement (numastat)
□ Check for swap activity (vmstat, sar)
□ Look for OOM events (dmesg)
□ Verify memory limits (cgroups, ulimit)
Virtual memory is the foundation upon which modern operating systems build process isolation, memory protection, and the illusion of infinite memory. The collaboration between hardware page table walkers, TLBs, and operating system page fault handlers creates a seamless abstraction that programmers often take for granted. Yet understanding these mechanisms deeply enables you to write more efficient code, debug mysterious performance problems, and make informed architectural decisions. Whether you’re optimizing a database buffer pool, debugging a memory leak, or designing a new system, the principles of virtual memory inform every aspect of how programs interact with the machine’s most fundamental resource.