[Paper Notes] Alibaba HPN: A Data Center Network for Large Language Model Training
SIGCOMM’24, by Alibaba
TL;DR: Alibaba Cloud designed HPN, a specialized two-tier dual-plane Ethernet-based network tailored specifically for large-scale LLM training. HPN effectively resolves challenges posed by bursty, low-entropy traffic and sensitivity to single-point failures. Key innovations include non-stacked dual-ToR architecture, dual-plane aggregation to eliminate hash polarization, rail-optimized tier1 segments with latest-gen switches, and extensive operational improvements. Deployed in production, HPN enhanced LLM training throughput by ~15%.
1 Problem / Motivation
Traditional data-center networks struggle to support Large Language Model (LLM) training due to:
- Bursty traffic: Periodic, high-throughput (400Gbps) short-duration flows from gradient synchronization, causing uneven load distribution.
- Low entropy: Few large flows (elephant flows) incompatible with ECMP hashing.
- Sensitivity to single-point failures: Any failure, especially at the ToR level, halts synchronized GPU training jobs, resulting in high recovery costs.
2 Key Ideas & Design Innovations
Architectural Overview
- Two-tier dual-plane architecture (vs traditional 3-tier Clos).
- Each Pod interconnects 15K GPUs (1024 GPUs per segment, 15 segments in total).
- Dual-plane eliminates hash polarization (traffic isolation between port 0/1), significantly reduces ECMP search space.
Dual-ToR Design (Non-stacked vs. Stacked)
- Traditional stacked dual-ToR has reliability risks (state-sync failures, upgrade incompatibilities).
- Non-stacked Dual-ToR:
- Removes direct synchronization link between two ToRs.
- ToRs independently synchronized via modified LACP and ARP/BGP host routes.
- Ensures no single-point failures at ToR, significantly improving reliability.
Rail-Optimized Network
- Each host with 8 GPUs & 8 backend NICs (2×200Gbps each).
- GPUs/NICs split across rails (8 rails in total) to optimize intra-host (NVLink) bandwidth.
- Minimizes inter-host traffic, maximizing GPU utilization.
- Communication between GPUs in different rails in different hosts: 1) GPU to GPU via NVLink within the same host; 2) GPU -> NIC -> ToR -> target NIC -> target GPU in the same rail network.
3 Architecture Deep Dive
Tier1 (Segment-level)
- Single-chip 51.2Tbps switch interconnects (128 + 8) X 200Gbps downstream ports and 60 X 400Gbps upstream ports per ToR.
- Employs backup ports to enhance host availability.
- For each host: 8 GPUs X 2 ports in dual-ToR design -> 16 ToRs in total for each segment.
- Customized cooling solution developed (optimized vapor chamber) to address overheating issues from increased chip power consumption.
Tier2 (Pod-level)
- Dual-plane aggregation eliminates hash polarization, load imbalance reduced by ~90%.
- Each Pod contains 15K GPUs (dual-plane doubles capacity).
- Optimized path selection via precise ECMP hashing and collective-library load balancing (disjoint-path aware).
Tier3 (Cross-Pod-level)
- High oversubscription (15:1), optimized for Pipeline Parallelism (PP) traffic which has lower bandwidth requirements.
- Dual-plane concept maintained at Core-level to prevent polarization issues for PP traffic across pods.
4 Frontend Network (Separated from Backend)
- Dedicated frontend NIC per host (2×200Gbps).
- Handles storage, management, and inference traffic separately, with full bisection bandwidth (1:1).
5 Operational Results
- Production deployment for 8 months without single-ToR failures.
- Observed performance gains:
- Overall training throughput: 14.9% improvement.
- Cross-segment traffic reduced by 37%.
- Collective communication performance boosted (up to 59% AllReduce, 158% Multi-AllReduce).
6 Reliability Evaluation
- Non-stacked dual-ToR ensures near-instant recovery from NIC-ToR link failures or flapping, compared to significant outages in single-ToR setups.
- Effective at avoiding catastrophic training interruptions.
7 Operational Experience & Lessons Learned
- Pod per data-center building aligns perfectly with current Alibaba infrastructure (18MW buildings, ~15K GPUs each).
- Use of multimode optical transceivers reduces intra-building networking costs by 70%.
- Rail-only tier2 rejected due to inflexibility in future model designs (e.g., MoE requires cross-rail communication).
- Complex wiring of HPN tackled using INT-based probes for verification.
8 Future-Proofing & Scalability
- Already planning future networks for upcoming 102.4Tbps chips.
- Support scalability to 100K GPUs through careful Core-level and cross-pod optimizations.
Comments
The paper proposes the dual-ToR to overcome the problem of single-point failure in ToR, with non-stacked design to further enhance the reliability. The dual-plane architecture also mitigates the hash polarization problem, targeting the low entropy challenge. The bursty traffic is prevented by more even load balancing with the help of rail-optimized network (isolating traffic of GPUs in different rails), dual-plane aggregation (isolating traffic of 2 ports in one NIC), and precise path selection.