-
[Paper Notes] Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents
-
[Paper Notes] Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
-
[Paper Notes] Alibaba HPN: A Data Center Network for Large Language Model Training
-
[Paper Notes] RDMA over Ethernet for Distributed AI Training at Meta Scale
-
[Paper Notes] Orion: Google’s Software-Defined Networking Control Plane
-
[Paper Notes] Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
-
[Book Notes] Site Reliability Engineering: How Google Runs Production Systems
-
[Paper Notes] Google Borg (Predecessor of Kubernetes)
-
Learning Transformer and KV Cache as an AI NewBie
-
[Paper Notes] Some Classic Distributed Systems