Reliability
- Keeping the Model Awake: Building a Self-Healing ML Inference Platform
· 2023-02-14
A field report on taming production machine learning inference with proactive healing, adaptive scaling, and human empathy.
- Timeouts, Retries, and Idempotency Keys: A Practical Guide
· 2022-09-08
Make your distributed calls safe under partial failure. How to budget timeouts, avoid retry storms, and use idempotency keys without shooting yourself in the foot.
- Stochastic Processes for Computer Science: Poisson, Brownian Motion, Queueing and Reliability
· 2022-02-12
A rigorous treatment of continuous-time stochastic processes—Poisson processes, CTMCs, Brownian motion with the reflection principle—and their applications in queueing theory, reliability engineering, and network performance.