[Paper Notes] Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
NSDI'25, by Alibaba
TL;DR: Aegis is Alibaba Cloud’s production fault-diagnosis stack for AI training. It evolved in two phases: (1) log-driven runtime triage with an offline backstop; (2) procedure-aware runtime localization by instrumenting the collective communication library (CCL). It also adds a pre-delivery check (CBD) and a degradation detector. In production: >97% less idle time wasted on diagnosis, −84% restarts, and −71% performance degradation.
1 Goal and scope
Diagnose the culprit at service runtime without touching customer code; isolate the bad device fast so training can continue; root–cause deep dives can be offline later. Focus on task failures and performance degradation.
2 Phase-1: Enhance existing systems (runtime + offline backstop)
- Inputs & alignment: time-align training/CCL logs, OS dmesg, NIC/driver logs, switch syslog/counters, and quick pingmesh snapshots per incident.
- Signal classes:
- CriticalError = hard faults (uncorrectable ECC, link down, device missing, fatal driver/thermal).
- DistError = cascade symptoms (e.g., NCCL aborts, “connection reset by peer”) that fan out across ranks.
- Decision rules (fast path):
- Any CriticalError on a host ⇒ isolate that host, restart job.
- DistError on a single src–dst pair ⇒ quarantine both ends (cheap collateral to recover quickly).
- Many hosts implicated ⇒ RootDiag: cluster first-failure edges; if all touch the same GPU/host, blame it; else mark “network-suspect” and run NetDiag.
- NetDiag (minutes): reachability/latency pingmesh; on anomalies, in-band hop tracing to localize the faulty ToR/Agg/optic.
- Offline backstop (last resort): reproduce on small disjoint subsets to fence the error region; run targeted host checks (GPU/PCIe/NVLink/NIC) and variable-size probes when network-suspect.
- Effect: converts most job-wide crashes into single-node isolations with minimal blast radius; ~70% less idle time from diagnosis before Phase-2.
3 Phase-2: Procedure-aware runtime diagnosis (custom CCL)
Design: Instrument the CCL (plugin to frameworks) to expose per-collective status at the compute/comm boundary—minimal customer impact and rich, timely signals.
Failure patterns:
- Worker-side failure: one rank in a group stalls at collective → pinpoint the guilty GPU group via collective-level counters.
- Communication failure: mismatches between sent/consumed work-requests ( vs. ) identify bad sources/destinations; run NetDiag on suspects.
Outcome: Runtime diagnosis rate rises from ~77% to nearly 100%, avoiding offline isolation most of the time.
4 Performance degradation diagnosis
- Basic correlation: Detect slowdowns with existing runtime metrics; pick device-revealing signals first.
- Procedure-aware enhancement: Use CCL-level skew tests; flag a GPU group G if its communication time exceeds (=1.5). Then attribute to source/dest with a root-diag pass.
5 CBD (Check-Before-Delivery)
- Why: 73% of failures happen during init; frequent updates and “post-usage” faults make hosts fail right after allocation.
- What: A parallel checklist (config + single-host + multi-host), ≲10 minutes full run; a 1-minute lightweight version for PaaS. Catches issues only visible after the full containerized environment is up.
- Effect: 1–2% bad hosts intercepted pre-handoff; CBD made mandatory.
6 Techniques worth copying
- Use the CCL as the universal probe surface—portable across frameworks, minimal privacy/code-change concerns, yet precise enough for culprit isolation.
- Keep an offline backstop, but relentlessly raise the share of runtime-solved cases.
- Pre-flight, parallelized acceptance tests right before resource delivery; keep a 1-minute “fast path”.
7 One-liners to remember
Phase-1 = logs + rules + offline backstop. Phase-2 = CCL-aware, near-100% runtime isolation. CBD = short, parallel, pre-handoff sieve. Degradation = metric correlation first, then CCL-level skew to attribute. Net result = fast isolation, fewer restarts, steadier iterations.