Distributed-Systems
- Epidemic Protocols: Gossip, HyParView, Plumtree, and the Mathematics of Infection-Style Dissemination
· 2025-12-08
How push, push-pull, and pull gossip propagate information with tunable reliability guarantees — plus HyParView for membership and Plumtree for efficient broadcast in large-scale dynamic networks.
- Distributed Snapshots: The Chandy-Lamport Algorithm, Lai-Yang, and the Foundations of Consistent Global State
· 2025-10-31
How do you capture a consistent snapshot of a running distributed system without stopping the world? The Chandy-Lamport algorithm, its non-FIFO extension by Lai and Yang, and the deep connection to checkpointing and deadlock detection.
- Inside Vector Databases: Building Retrieval-Augmented Systems that Scale
· 2025-10-26
How modern vector databases ingest, index, and serve embeddings for production retrieval-augmented generation systems without falling over.
- Distributed Systems: Consensus, Consistency, and Fault Tolerance
· 2025-10-20
Fundamentals of distributed systems: failure models, consensus algorithms (Paxos, Raft), CAP theorem, consistency models, gossip, membership, CRDTs, and practical testing strategies like Jepsen.
- Clock Synchronization: Lamport Clocks, Vector Clocks, Hybrid Logical Clocks, and the CRDT Connection
· 2025-10-08
From scalar Lamport clocks that capture causality to vector clocks that characterize it precisely, through hybrid logical clocks that bridge physical and logical time — the intellectual lineage of distributed timekeeping.
- The 100‑Microsecond Rule: Why Tail Latency Eats Your Throughput (and How to Fight Back)
· 2025-10-04
A field guide to taming P99 in modern systems—from queueing math to NIC interrupts, from hedged requests to adaptive concurrency. Practical patterns, pitfalls, and a blueprint you can apply this week.
- Time in Distributed Systems: NTP, PTP, TrueTime, and the Impossibility of Perfect Synchronization
· 2025-10-01
From Marzullo's algorithm in NTP to hardware timestamping in PTP and Google's TrueTime in Spanner — how distributed systems wrestle with the fundamental impossibility of perfectly synchronized clocks.
- The Quiet Calculus of Probabilistic Commutativity
· 2025-09-27
A practical calculus for quantifying when non-commutative operations in distributed systems can be safely executed without heavyweight coordination.
- The Hidden Backbone of Parallelism: How Prefix Sums Power Distributed Computation
· 2025-09-21
Discover how the humble prefix sum (scan) quietly powers GPUs, distributed clusters, and big data frameworks—an obscure but essential building block of parallel and distributed computation.
- Queueing Theory for Systems Engineers: From M/M/1 to Heavy-Tail Distributions and Tail-at-Scale
· 2025-07-18
Master queueing theory as a practical tool for systems design: the M/M/1 model, Little's Law, Jackson networks, the dramatic impact of heavy-tailed service times on tail latency, and how to apply these insights to load balancers, microservices, and capacity planning.
- Error-Correcting Codes: Reed-Solomon, LDPC, and How Distributed Storage Survives Failure
· 2025-05-18
Build error-correcting codes from the ground up: finite field arithmetic, Reed-Solomon encoding and decoding via Lagrange interpolation, LDPC codes and belief propagation, and how modern distributed storage systems use erasure coding to survive disk failures with minimal overhead.
- Linearizability and Serializability: A Formal Hierarchy of Consistency Models
· 2025-01-28
Build a rigorous understanding of consistency models from linearizability to eventual consistency, with formal definitions, counterexamples, and the practical implications for distributed database design.
- The FLP Impossibility Result: Why Distributed Consensus Is Fundamentally Hard
· 2025-01-15
Explore the landmark Fischer-Lynch-Paterson result that proved no deterministic algorithm can achieve consensus in an asynchronous system with even one faulty process — and how the field evolved around this impossibility.
- TCP Congestion Control: From Slow Start to BBR
· 2023-02-11
A comprehensive exploration of TCP congestion control algorithms, from classic approaches like Tahoe and Reno to modern innovations like BBR. Learn how these algorithms balance throughput, fairness, and latency across diverse network conditions.
- Threshold Cryptography: Distributed Key Generation, Threshold ECDSA, and the Validator Use Case
· 2023-02-03
A rigorous look at threshold cryptography from Shamir secret sharing through GJKR distributed key generation to modern threshold ECDSA and BLS signatures for blockchain validators.
- Timeouts, Retries, and Idempotency Keys: A Practical Guide
· 2022-09-08
Make your distributed calls safe under partial failure. How to budget timeouts, avoid retry storms, and use idempotency keys without shooting yourself in the foot.
- Designing CRDT-Powered Collaboration Platforms that Stay Consistent
· 2022-08-17
Deep dive into how conflict-free replicated data types underpin realtime editors, whiteboards, and multiplayer apps without sacrificing UX.
- State Machine Replication: Viewstamped Replication Protocol, Zab (ZooKeeper Atomic Broadcast), and the Consensus-Scalability Continuum
· 2021-07-27
A deep exploration of state machine replication — how Viewstamped Replication and Zab enable fault-tolerant services through ordered command execution, and how the consensus-scalability continuum shapes modern distributed systems design.
- Streaming Systems: Apache Flink Checkpointing, Kafka Log Compaction, Watermarks and Event-Time Processing, and Exactly-Once Semantics
· 2021-07-22
A deep exploration of streaming systems — how Flink's distributed checkpointing provides exactly-once state consistency, how Kafka's log compaction enables durable event storage, and how watermarks solve the event-time vs processing-time dilemma.
- Object Storage: RADOS/Ceph Architecture, the CRUSH Placement Algorithm, S3 API Semantics, and Erasure Coding at Scale
· 2021-06-21
A deep exploration of object storage — how Ceph's RADOS and CRUSH algorithm enable scalable, self-managing storage clusters, the S3 API's influence on cloud storage, and how erasure coding reduces storage overhead.
- Distributed File Systems: GFS Design, HDFS Architecture, the Colossus Evolution, and Single-Master Metadata Bottlenecks
· 2021-06-18
A deep exploration of distributed file systems — how Google's GFS pioneered the single-master model, how HDFS adapted it for the Hadoop ecosystem, and how modern systems have evolved beyond the single-master bottleneck.
- Raft Fast‑Commit and PreVote in Practice
· 2020-11-09
What fast‑commit and PreVote actually change in Raft, how they affect availability during leader changes, and where the footguns are.
- Safe Rollback Strategies for Distributed Databases
· 2020-11-08
A comprehensive guide to designing, executing, and validating rollbacks in distributed database environments without compromising data integrity or customer trust.
- Consistent Hashing: Distributing Data Across Dynamic Clusters
· 2020-03-28
A deep dive into consistent hashing, the elegant algorithm that enables scalable distributed systems. Learn how it works, why it matters for databases and caches, and explore modern variations like jump consistent hashing and rendezvous hashing.