A RoCE network for distributed AI training at scale

jauntywundrkind · on Aug 5, 2024

From the paper, seems like they are using RDMA to/from video cards, skipping the nic.

> * These transactions require GPU-to-RDMA NIC support for optimal performance*

Remarkably consumer computing actually has similarly found reason to bypass sending data through the cpu; texture streaming. DirectStorage and Sony's Kraken purport to let the GPU read direct from the SSD. It's a storage application instead of NIC, but still built around PCIe DMA-P2P (at least the DirectStorage is I think).

Table 2, network stats for 128 GPUs is kind of interesting. Most topologies such as AllGather and AllReduce run with only 4 Queue Pairs. Not my area of expertise at all but wow that seems tiny! All this network, and basically everyone's talking to only a few peers? That's what it means right?

The discussion at the end of the paper talked about Flowlets. The description makes me think a little bit of hash bucket chaining, where you try the first path, and if latter a conflict arise or the oath degrades, there's a fallback path already planned. Like there's would be a fallback chained bucket in a hash.

wmf · on Aug 5, 2024

The NIC is still there but they're skipping the data copy from system RAM to GPU RAM. https://developer.nvidia.com/gpudirect

latchkey · on Aug 6, 2024

Both Intel (Gaudi 3) and Tenstorrent (Wormhole) have built NIC's into the GPUs. I think that we are going to see more of that in the future, but the only problem with this is that your failure cases become more extreme. NIC dies, I swap a NIC... GPUNIC dies... I swap both.

128 NICs/GPUs is the largest single non-blocking 400G switch size today using a broadcom t5 chip [0]. If you want to go larger than that, you have to go with 6x switches in a spine/leaf configuration, which greatly increases the costs.

We're (Hot Aisle) in the process of deploying a cluster of 128 MI300x right now with this configuration using a Dell Z9864F switch [1]. https://hotaisle.xyz/networking/ and https://hotaisle.xyz/compute/

[0] https://www.broadcom.com/products/ethernet-connectivity/swit...

[1] https://www.delltechnologies.com/asset/en-us/products/networ...

mrlongroots · on Aug 6, 2024

> still built around PCIe DMA-P2P

Right! GPUs are bandwidth behemoths. If you have a 400 Gbit/s network and a GPU that can saturate it, there's a very good bandwidth reason to cut the CPU out of the path. For collectives, you are probably optimizing for latency vs bandwidth, but in all cases the less you move the data the better it is.

> Most topologies such as AllGather and AllReduce run with only 4 Queue Pairs. Not my area of expertise at all but wow that seems tiny! All this network, and basically everyone's talking to only a few peers? That's what it means right?

1. It seems that the cardinality of the collective groups is quite small (~128 GPUs), despite the reported cluster being 16k GPUs. They attribute it to "multidimensional parallelism" and while I don't speak much ML, I'm guessing that it's some sort of horizontal + vertical partitioning, and collectives are only triggered within a partition.

2. As these are collectives, all nodes do participate, but you create an overlay network with a ring or tree topology per collective. In a ring, every node just needs a QP each for left and right neighbors. For trees, it depends on the fanout. Now you could have shallow trees with larger fanouts, as latency is a function of the depth, and I feel like you can have more than 4 QPs (~500 is where NIC SRAM becomes an issue AFAIK). I think there is some buffering overhead per QP (beyond the standard NIC state) that they are trying to minimize, haven't read the paper thoroughly.

3. You do need a fancy network for collectives. They are a latency-sensitive operation on the critical path. The entire multimillion dollar thing is idle while an AllReduce is active, so 500us vs 100us in E2E latency matters a lot.

eslaught · on Aug 5, 2024

So they're re-inventing HPC networks in the data center.

https://en.wikipedia.org/wiki/Fat_tree

https://www.cs.umd.edu/class/spring2021/cmsc714/readings/Kim...

I'm sure there are innovations here, but most of this has been standard in HPC for decades. (Fat trees since 1985, Dragonfly since 2008.) This is not new science, folks.

wmf · on Aug 5, 2024

It's not new science, but tuning RoCE performance is new engineering.

mrlongroots · on Aug 6, 2024

And honestly the difference between science and engineering in systems is overrated. The real difference is "systems that work, or everyone knows how to run" vs "systems that are a bit of black magic".

While on paper it is true that HPC has had collectives and dragonflies and RDMA for ages, a whole host of issues make tuning these things hard and tedious. Such work also informs the design of future "smart networks" by keeping everyone updated with common problems and workarounds that folks employ.

teleforce · on Aug 6, 2024

Interesting approach on distributed AI training albeit a very expensive one. Personally I'm baffled why no one has come up with a similar project to SETI@home or Great Internet Mersenne Prime Search in harnessing truly distributed and low cost solutions to open model of AI training at scale [1],[2].

[1] SETI@home:

https://setiathome.berkeley.edu/

[2] Great Internet Mersenne Prime Search:

https://en.wikipedia.org/wiki/Great_Internet_Mersenne_Prime_...

dan-robertson · on Aug 7, 2024

I think there’s two reasons:

1. AI training for large models just requires a lot of bytes – there are lots of parameters and lots of training data, and I think you don’t provide such useful training by just reusing the same subset of the training data. Similarly you can’t really train just a fraction of the parameters. The more diverged participants are, the less useful their contributions may be so I think you’d need to move the parameters back and forth quite frequently too

2. People just don’t have that much compute at home. And you probably need more compute because of the bandwidth/latency issues.

teleforce · on Aug 8, 2024

Both your points are relevant but neither are critical nor crucial for successful training of GPT foundation models.

For the first point, unlike real-time queries of ChatGPT prompts, training does not relied on minimum latency for feedback.

For the second point, our average laptops and PCs nowadays are much better than the highest end graphics workstation in gaming industry AAA studios about 30 years ago circa 1990s.

Last year Google made an interesting memo mentioning that there's no company has the moat on GPT and eventually nobody can compete with open source AI model. Perhaps Google is referring and alluding to the potential of distributed global collaborations for training of foundation models not unlike the SETI@home and similar approaches.

dan-robertson · on Aug 13, 2024

I’m pretty confused because your response seems very obviously confused.

Firstly, the latency problem in training is indeed not the same as that of evaluation. The cycle looks like 1. Input model weights, 2. Choose some subsets of training data, 3. Update weights based on model evaluated against, 4. Back to step 1. When you do distributed training, step 4 requires collecting updated weights from each worker and then merging them together. You want some limits on the amount of computation between each merge because the merge is less useful when the parts have diverged further (imagine some super simple example where a model has two neurones in a layer and they tend to become sensitive to situations A and B. If you distribute your training and some come back with neurone one becoming strongly sensitive to case A and others with it strongly sensitive to case B, and you merge them together, you likely get a fuzzy thing where the first is not very sensitive to either case yet, potentially throwing away a lot of the information that was gained in training.

To put some numbers on that, let’s say you have some 1B parameter model fast home network at 300 Mbps with 2ms latency to whatever is doing the coordination (ie probably in the same city). Maybe with some clever compression or dropout layers you cause an update to be 1GB. That’s then a little over a minute to send your weights out for coordination and get back 1GB of fresh weights (probably this would be larger than 1GB and this ignores any synchronisation with other nodes). If you compare this to the networking fabric you might expect in a GPU cluster, that’s something like 200Gbps and a few us of latency so even sending 4GB of weights back and forth (makes more sense as compression costs compute time) is more like 200ms plus latency. Note that the difference when trying to be favourable to your home setup is >300x.

Secondly, let’s look at compute. A ‘smallish’ cluster of GPUs might look like 8 H100s in some very specialised box, plus a bunch of extra supporting hardware. That’ll give you over 7e15 half-precision flops. In your train@home example, someone with a top-of-the-line prosumer GPU might get something like 1e14 half-precision flops, though I think you’d probably be pretty lucky to get 1/5 of that. So it takes 70 (more like 350) of these machines to match the flops of the smallish cluster, and that ignores differences in memory bandwidth and in the coordination of moving more data around (because each node is smaller) over slower links.

To me, it seems the main things that make it unviable are the coordination and the network requirements, but maybe something sufficiently clever could someday make this kind of distribution more viable.

If you compare this whole thing to running Monte Carlo algorithms like folding@home, hopefully you can see the difference in that the size of the problem description is a lot smaller and synchronisation isn’t really required at all. And didn’t folding@home largely fail anyway?

immibis · on Aug 6, 2024

Unlike those projects, AI training requires very high bandwidth between nodes.

sooperserieous · on Aug 6, 2024

They have: http://salad.com and others

zuckerma · on Aug 6, 2024

This is slick