[Paper Notes] Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production

10 Aug, 2025

NSDI'25, by Alibaba

TL;DR: Aegis is Alibaba Cloud’s production fault-diagnosis stack for AI training. It evolved in two phases: (1) log-driven runtime triage with an offline backstop; (2) procedure-aware runtime localization by instrumenting the collective communication library (CCL). It also adds a pre-delivery check (CBD) and a degradation detector. In production: >97% less idle time wasted on diagnosis, −84% restarts, and −71% performance degradation.

1 Goal and scope

Diagnose the culprit at service runtime without touching customer code; isolate the bad device fast so training can continue; root–cause deep dives can be offline later. Focus on task failures and performance degradation.

2 Phase-1: Enhance existing systems (runtime + offline backstop)

Inputs & alignment: time-align training/CCL logs, OS dmesg, NIC/driver logs, switch syslog/counters, and quick pingmesh snapshots per incident.
Signal classes:
- CriticalError = hard faults (uncorrectable ECC, link down, device missing, fatal driver/thermal).
- DistError = cascade symptoms (e.g., NCCL aborts, “connection reset by peer”) that fan out across ranks.
Decision rules (fast path):
1. Any CriticalError on a host ⇒ isolate that host, restart job.
2. DistError on a single src–dst pair ⇒ quarantine both ends (cheap collateral to recover quickly).
3. Many hosts implicated ⇒ RootDiag: cluster first-failure edges; if all touch the same GPU/host, blame it; else mark “network-suspect” and run NetDiag.
NetDiag (minutes): reachability/latency pingmesh; on anomalies, in-band hop tracing to localize the faulty ToR/Agg/optic.
Offline backstop (last resort): reproduce on small disjoint subsets to fence the error region; run targeted host checks (GPU/PCIe/NVLink/NIC) and variable-size probes when network-suspect.
Effect: converts most job-wide crashes into single-node isolations with minimal blast radius; ~70% less idle time from diagnosis before Phase-2.

3 Phase-2: Procedure-aware runtime diagnosis (custom CCL)

Design: Instrument the CCL (plugin to frameworks) to expose per-collective status at the compute/comm boundary—minimal customer impact and rich, timely signals.
Failure patterns:
- Worker-side failure: one rank in a group stalls at collective $C_{i}$ → pinpoint the guilty GPU group via collective-level counters.
- Communication failure: mismatches between sent/consumed work-requests ( $W R_{i, j}$ vs. $W C_{i, j}$ ) identify bad sources/destinations; run NetDiag on suspects.
Outcome: Runtime diagnosis rate rises from ~77% to nearly 100%, avoiding offline isolation most of the time.

4 Performance degradation diagnosis

Basic correlation: Detect slowdowns with existing runtime metrics; pick device-revealing signals first.
Procedure-aware enhancement: Use CCL-level skew tests; flag a GPU group G if its communication time $N_{i, j, k}$ exceeds $β \cdot N_{i, k}$ ( $β$ =1.5). Then attribute to source/dest with a root-diag pass.

5 CBD (Check-Before-Delivery)

Why: 73% of failures happen during init; frequent updates and “post-usage” faults make hosts fail right after allocation.
What: A parallel checklist (config + single-host + multi-host), ≲10 minutes full run; a 1-minute lightweight version for PaaS. Catches issues only visible after the full containerized environment is up.
Effect: 1–2% bad hosts intercepted pre-handoff; CBD made mandatory.

6 Techniques worth copying

Use the CCL as the universal probe surface—portable across frameworks, minimal privacy/code-change concerns, yet precise enough for culprit isolation.
Keep an offline backstop, but relentlessly raise the share of runtime-solved cases.
Pre-flight, parallelized acceptance tests right before resource delivery; keep a 1-minute “fast path”.

7 One-liners to remember

Phase-1 = logs + rules + offline backstop. Phase-2 = CCL-aware, near-100% runtime isolation. CBD = short, parallel, pre-handoff sieve. Degradation = metric correlation first, then CCL-level skew to attribute. Net result = fast isolation, fewer restarts, steadier iterations.