Gpu

The Hidden Backbone of Parallelism: How Prefix Sums Power Distributed Computation · 2025-09-21
Discover how the humble prefix sum (scan) quietly powers GPUs, distributed clusters, and big data frameworks—an obscure but essential building block of parallel and distributed computation.
GPUDirect Storage in 2025: Optimizing the End-to-End Data Path · 2025-09-16
How modern systems move data from NVMe and object storage into GPU kernels with minimal CPU overhead and maximal throughput.
Tuning CUDA with the GPU Memory Hierarchy · 2024-11-27
Global, shared, and register memory each have distinct latency and bandwidth. Performance comes from the right access pattern.
Cache‑Friendly Data Layouts: AoS vs. SoA (and the Hybrid In‑Between) · 2021-03-18
How memory layout choices shape the performance of your hot loops. A practical guide to arrays‑of‑structs, struct‑of‑arrays, and hybrid layouts across CPUs and GPUs.