Qianliang's blog

[Paper Notes] Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production

NSDI'25, by Alibaba

TL;DR: Aegis is Alibaba Cloud’s production fault-diagnosis stack for AI training. It evolved in two phases: (1) log-driven runtime triage with an offline backstop; (2) procedure-aware runtime localization by instrumenting the collective communication library (CCL). It also adds a pre-delivery check (CBD) and a degradation detector. In production: >97% less idle time wasted on diagnosis, −84% restarts, and −71% performance degradation.

1 Goal and scope

Diagnose the culprit at service runtime without touching customer code; isolate the bad device fast so training can continue; root–cause deep dives can be offline later. Focus on task failures and performance degradation.

2 Phase-1: Enhance existing systems (runtime + offline backstop)

3 Phase-2: Procedure-aware runtime diagnosis (custom CCL)

4 Performance degradation diagnosis

5 CBD (Check-Before-Delivery)

6 Techniques worth copying

7 One-liners to remember

Phase-1 = logs + rules + offline backstop. Phase-2 = CCL-aware, near-100% runtime isolation. CBD = short, parallel, pre-handoff sieve. Degradation = metric correlation first, then CCL-level skew to attribute. Net result = fast isolation, fewer restarts, steadier iterations.