Qianliang's blog

[Paper Notes] Alibaba HPN: A Data Center Network for Large Language Model Training

SIGCOMM’24, by Alibaba

TL;DR: Alibaba Cloud designed HPN, a specialized two-tier dual-plane Ethernet-based network tailored specifically for large-scale LLM training. HPN effectively resolves challenges posed by bursty, low-entropy traffic and sensitivity to single-point failures. Key innovations include non-stacked dual-ToR architecture, dual-plane aggregation to eliminate hash polarization, rail-optimized tier1 segments with latest-gen switches, and extensive operational improvements. Deployed in production, HPN enhanced LLM training throughput by ~15%.

1 Problem / Motivation

Traditional data-center networks struggle to support Large Language Model (LLM) training due to:

2 Key Ideas & Design Innovations

Architectural Overview

Dual-ToR Design (Non-stacked vs. Stacked)

Rail-Optimized Network

3 Architecture Deep Dive

Tier1 (Segment-level)

Tier2 (Pod-level)

Tier3 (Cross-Pod-level)

4 Frontend Network (Separated from Backend)

5 Operational Results

6 Reliability Evaluation

7 Operational Experience & Lessons Learned

8 Future-Proofing & Scalability

Comments

The paper proposes the dual-ToR to overcome the problem of single-point failure in ToR, with non-stacked design to further enhance the reliability. The dual-plane architecture also mitigates the hash polarization problem, targeting the low entropy challenge. The bursty traffic is prevented by more even load balancing with the help of rail-optimized network (isolating traffic of GPUs in different rails), dual-plane aggregation (isolating traffic of 2 ports in one NIC), and precise path selection.