[Paper Notes] Alibaba HPN: A Data Center Network for Large Language Model Training

06 Aug, 2025

SIGCOMM’24, by Alibaba

TL;DR: Alibaba Cloud designed HPN, a specialized two-tier dual-plane Ethernet-based network tailored specifically for large-scale LLM training. HPN effectively resolves challenges posed by bursty, low-entropy traffic and sensitivity to single-point failures. Key innovations include non-stacked dual-ToR architecture, dual-plane aggregation to eliminate hash polarization, rail-optimized tier1 segments with latest-gen switches, and extensive operational improvements. Deployed in production, HPN enhanced LLM training throughput by ~15%.

1 Problem / Motivation

Traditional data-center networks struggle to support Large Language Model (LLM) training due to:

Bursty traffic: Periodic, high-throughput (400Gbps) short-duration flows from gradient synchronization, causing uneven load distribution.
Low entropy: Few large flows (elephant flows) incompatible with ECMP hashing.
Sensitivity to single-point failures: Any failure, especially at the ToR level, halts synchronized GPU training jobs, resulting in high recovery costs.

2 Key Ideas & Design Innovations

Architectural Overview

Two-tier dual-plane architecture (vs traditional 3-tier Clos).
Each Pod interconnects 15K GPUs (1024 GPUs per segment, 15 segments in total).
Dual-plane eliminates hash polarization (traffic isolation between port 0/1), significantly reduces ECMP search space.

Dual-ToR Design (Non-stacked vs. Stacked)

Traditional stacked dual-ToR has reliability risks (state-sync failures, upgrade incompatibilities).
Non-stacked Dual-ToR:
- Removes direct synchronization link between two ToRs.
- ToRs independently synchronized via modified LACP and ARP/BGP host routes.
- Ensures no single-point failures at ToR, significantly improving reliability.

Rail-Optimized Network

Each host with 8 GPUs & 8 backend NICs (2×200Gbps each).
GPUs/NICs split across rails (8 rails in total) to optimize intra-host (NVLink) bandwidth.
Minimizes inter-host traffic, maximizing GPU utilization.
Communication between GPUs in different rails in different hosts: 1) GPU to GPU via NVLink within the same host; 2) GPU -> NIC -> ToR -> target NIC -> target GPU in the same rail network.

3 Architecture Deep Dive

Tier1 (Segment-level)

Single-chip 51.2Tbps switch interconnects (128 + 8) X 200Gbps downstream ports and 60 X 400Gbps upstream ports per ToR.
Employs backup ports to enhance host availability.
For each host: 8 GPUs X 2 ports in dual-ToR design -> 16 ToRs in total for each segment.
Customized cooling solution developed (optimized vapor chamber) to address overheating issues from increased chip power consumption.

Tier2 (Pod-level)

Dual-plane aggregation eliminates hash polarization, load imbalance reduced by ~90%.
Each Pod contains 15K GPUs (dual-plane doubles capacity).
Optimized path selection via precise ECMP hashing and collective-library load balancing (disjoint-path aware).

Tier3 (Cross-Pod-level)

High oversubscription (15:1), optimized for Pipeline Parallelism (PP) traffic which has lower bandwidth requirements.
Dual-plane concept maintained at Core-level to prevent polarization issues for PP traffic across pods.

4 Frontend Network (Separated from Backend)

Dedicated frontend NIC per host (2×200Gbps).
Handles storage, management, and inference traffic separately, with full bisection bandwidth (1:1).

5 Operational Results

Production deployment for 8 months without single-ToR failures.
Observed performance gains:
- Overall training throughput: 14.9% improvement.
- Cross-segment traffic reduced by 37%.
- Collective communication performance boosted (up to 59% AllReduce, 158% Multi-AllReduce).

6 Reliability Evaluation

Non-stacked dual-ToR ensures near-instant recovery from NIC-ToR link failures or flapping, compared to significant outages in single-ToR setups.
Effective at avoiding catastrophic training interruptions.

7 Operational Experience & Lessons Learned

Pod per data-center building aligns perfectly with current Alibaba infrastructure (18MW buildings, ~15K GPUs each).
Use of multimode optical transceivers reduces intra-building networking costs by 70%.
Rail-only tier2 rejected due to inflexibility in future model designs (e.g., MoE requires cross-rail communication).
Complex wiring of HPN tackled using INT-based probes for verification.

8 Future-Proofing & Scalability

Already planning future networks for upcoming 102.4Tbps chips.
Support scalability to 100K GPUs through careful Core-level and cross-pod optimizations.

Comments

The paper proposes the dual-ToR to overcome the problem of single-point failure in ToR, with non-stacked design to further enhance the reliability. The dual-plane architecture also mitigates the hash polarization problem, targeting the low entropy challenge. The bursty traffic is prevented by more even load balancing with the help of rail-optimized network (isolating traffic of GPUs in different rails), dual-plane aggregation (isolating traffic of 2 ports in one NIC), and precise path selection.