[Paper Notes] RDMA over Ethernet for Distributed AI Training at Meta Scale
SIGCOMM’24, by Meta
TL;DR: Meta deployed RDMA over Converged Ethernet (RoCE) networks to support massive distributed AI training workloads at scale. They designed specialized backend networks, implemented advanced routing techniques, pivoted away from traditional congestion control (DCQCN) in favor of collective-library-based congestion management, and developed extensive operational tooling, achieving optimal GPU training performance across thousands of GPUs.
1 Problem / Motivation
- AI training workloads require high bandwidth, low latency, and predictable network performance for tightly synchronized GPU operations.
- Traditional TCP/IP solutions suffer from high CPU overhead and latency; proprietary interconnects (InfiniBand, NVSwitch) lack flexibility.
- RoCE provides an open-standard, CPU-bypass method for low-latency GPU-to-GPU communication at scale.
2 Key Ideas & Design Decisions
Network Design Principles
- Dedicated Backend (BE) network for GPU communication; separate Frontend (FE) network for storage, logging, etc.
- RoCE leverages existing Ethernet infrastructure and tooling, ensuring operational simplicity and vendor diversity.
RoCE Selection Motivation
- Open standards with RDMA verbs semantics familiar to training applications.
- Seamless integration into existing Clos-based data center networks.
3 Architecture Overview
Training Nodes
- Grand Teton (H100 GPUs, 8 GPUs per node interconnected via NVLink/NVSwitch, each GPU linked 1:1 with RDMA NIC).
- GPUDirect technology for GPU-to-GPU communication bypassing host CPU and memory.
Network Topology
Two-stage Clos (leaf-spine) architecture:
- RTSW (Rack Training Switch): Leaf tier, connects GPUs within a rack.
- CTSW (Cluster Training Switch): Spine tier, deep-buffered switches to absorb congestion bursts.
- ATSW (Aggregator Training Switch): Aggregation tier connecting multiple Clos clusters (AI Zones) for DC-scale.
4 Routing Evolution & Techniques
Challenges
- Training workloads feature low entropy, burstiness, and large elephant flows, making routing challenging.
Routing Strategies Evaluated
- ECMP: Initial trials showed poor load balancing due to low entropy (hash collisions).
- Path Pinning: Effective only under ideal conditions; struggled with fragmented or partially allocated racks (static slice).
- Enhanced ECMP (E-ECMP) + QP Scaling:
- Increased flow entropy via additional hashing on Queue Pairs (QP) using switch ASICs.
- Improved collective performance by up to 40%, but complexity increased with tuning per workload type.
- Centralized Traffic Engineering (TE):
- CSPF-based real-time optimized routing.
- Excellent load balancing, but vulnerable to multi-link failures and higher complexity overhead.
- Flowlet Switching (future direction):
- Hardware-assisted dynamic re-routing of flows to optimize link utilization while managing out-of-order packets.
- caveat: trade-offs of flowlet interval between high out-of-order packets (small inverval) vs high responsiveness.
Operational Trade-off
- Final strategy: TE for smaller AI clusters (ranking workloads) and E-ECMP for DC-scale clusters (LLMs).
5 Transport & Congestion Control
Initial Attempt (DCQCN)
- Difficult tuning; ECN marking thresholds tightly coupled with performance degradation.
- Poor performance improvements and higher complexity led to abandoning DCQCN.
Adopted Solution
- Receiver-driven admission control at the collective library layer:
- GPU receivers explicitly control when senders transmit (CTS-based mechanism).
- Reduced incast congestion significantly; no observed head-of-line blocking or prolonged congestion.
- NIC QoS prioritization for CTS/ACK packets to minimize congestion impact.
Network Buffer Strategy
- Deep-buffered spine switches (CTSW) absorb temporary congestion bursts effectively.
6 Operational Experience & Co-tuning Insights
Co-tuning Network and Collective Libraries
- Substantial performance improvements through precise tuning of:
- NCCL parameters (message size, logical topology selection, channel buffering strategy).
- QoS prioritization, GPU-to-NIC PCIe settings, and credit adjustments.
- Achieved over 2x performance improvement in some scenarios by systematically addressing both network and software stack bottlenecks.
Routing Impact Over Time (Measured Data)
- Initial stages (static routing, undersubscribed): poor/inconsistent performance.
- Traffic engineering reduced infrastructure requirements significantly (1:1.125 subscription) without performance loss.
7 Observability & Troubleshooting
Tooling Developed
- Real-time telemetry: RDMA NIC counters (out-of-sequence packets, ACK timeouts).
- PFC watchdog, buffer threshold monitoring, reachability monitoring.
Troubleshooting Examples
- Software image regression affecting internal switch latency.
- Complex interplay of firmware updates, bursty traffic, and new hardware (H100 GPUs) resulting in intermittent packet drops.
8 Practical Lessons Learned & Future Work
- Traditional congestion control mechanisms (DCQCN) ill-suited for AI training collectives.
- Receiver-driven flow control via collective libraries is highly effective.
- Operational complexity vs. performance optimization is a delicate trade-off.
- Continuous investment in telemetry, monitoring, and automated recovery is critical.