Qianliang's blog

[Paper Notes] RDMA over Ethernet for Distributed AI Training at Meta Scale

SIGCOMM’24, by Meta

TL;DR: Meta deployed RDMA over Converged Ethernet (RoCE) networks to support massive distributed AI training workloads at scale. They designed specialized backend networks, implemented advanced routing techniques, pivoted away from traditional congestion control (DCQCN) in favor of collective-library-based congestion management, and developed extensive operational tooling, achieving optimal GPU training performance across thousands of GPUs.

1 Problem / Motivation

2 Key Ideas & Design Decisions

Network Design Principles

RoCE Selection Motivation

3 Architecture Overview

Training Nodes

Network Topology

Two-stage Clos (leaf-spine) architecture:

4 Routing Evolution & Techniques

Challenges

Routing Strategies Evaluated

Operational Trade-off

5 Transport & Congestion Control

Initial Attempt (DCQCN)

Adopted Solution

Network Buffer Strategy

6 Operational Experience & Co-tuning Insights

Co-tuning Network and Collective Libraries

Routing Impact Over Time (Measured Data)

7 Observability & Troubleshooting

Tooling Developed

Troubleshooting Examples

8 Practical Lessons Learned & Future Work