[music] Hey, everybody! This is Preetham. The mania and I’m a software engineer at Facebook working on Pytorch distributor. I’m going to, I’m going to talk a little bit more about Python distributed today. In terms of agenda, I’m going to talk a little bit about distributed data parallel, Uh, which is DDP and c10d which is a distributed communication library. Uh, and then I’ll talk a little bit about, like future work in terms of what’s what’s coming in the future for Python distributed, so let’s take a quick refresher of distributed data parallel first. Um, so if you have a single model, that’s small enough to fit on a single GPU. Uh, what you do is like you’d use distributed data parallel to train this on a large scale in terms of large amount of data and large amount of, uh, gpus. Um, so you would replicate this model on multiple gpus run the forward and backward pass in parallel and then once you have the gradients, you’d have like a synchronized gradient operation, which all of the ranks will enter to kind of aggregate all of the gradients, and then you’d kind of continue other iterations where you run more and more forward and backward passes and synchronize gradients. So that was a quick overview. Now, let’s talk about what’s coming. Uh, what are kind of some of the new improvements in DDP? So the first one is ddp communication hook? So what this does it allows you to completely override the synchronized gradient operation that I just talked about, so you can register Python callable and then have some arbitrary logic in terms of how you want to like aggregate the gradients so one example here is if you’ll actually want to do fp16 compression ingredients before you communicate that, uh, you could have, like a callable like this where you kind of compress the gradients, you convert them to float 16 you all reduce and float 16 and then finally, you decompress back to flow 32. So this is one example you could do more fancier things like gossip grad, which is a non full sync SGD algorithm, okay. The next item is support for uneven inputs in ddp. Um, so if you have like uneven number of batches across different ranks, what would typically happen is, uh, some ranks, which have like finished. Their data would not Inc. Enter, like, uh, the synchronized gradient call. While other ranks, which are still kind of processing data would enter this call and as a result, this will lead to either a hang or some sort of timeout. So this has been a long-standing issue in DDP that a lot of pytorch users have complained about. So now we do have a fix for this. You can use this. Uh, modeljoin context wrapper That’s shown in the example here. Uh, so what this ensures is that, uh, once some ranks are finished with their data. They’ll kind of do a bunch of dummy synchronized operations to kind of match other ranks, which are still processing data and this guarantees all ranks, uh, complete their processing altogether. Uh, so this is a nice. Uh, way to kind of deal with uneven inputs across your training. Uh, then we have some memory optimizations for DDP. Uh, so ddp today creates a bunch of buckets and, uh, to kind of batch parameters together for an all reduced call, which is much more efficient. Um, but what DDP does is it creates an entire copy of the gradients for these, Uh, buckets. So if you have a one gigabyte model, you’ll have like one gigabyte of parameters. One gigabyte of gradients and the DDP would take another gigabyte because it creates a copy of these gradients so to get around this, we have a new parameter in DDP called gradient as bucket view. So what this does is it makes the dot grad field of your parameters, a view of the bucket, so that way we kind of have only one copy of the gradients, then I’d like to talk about combining DDP and rpcs so Shen in his talk kind of described the RPC framework and how it can be used for distributed model parallelism and DDP is used for distributed data parallelism. So now we can kind of combine both of these frameworks together to kind of have more complicated. Uh, training paradigms, so as you can see in this example, we have a DDP model where it’s a model, uh, wrapped in ddp here, so that model is replicated and then we have some remote parameters here on worker one. So now, if you’d like to train this model, you set up your distributed optimizer. Um, in your forward pass, you retrieve the RF, which is typically retrieved via RPC. And then you feed that into DDP. Uh, you compute the loss and then you run your backward and, uh, Optimizer step. Uh, so this will kind of run the backward. It’ll kind of aggregate the gradients across all of the replicas and then also update the remote parameters gradients, and it’ll kind of run the OPTIMIZER remotely as well. So as you can see, you can like combine both of these frameworks pretty seamlessly. Okay, now I’d like to talk about, uh, dynamic bucketing in DDP, So DDP would kind of split the parameters into multiple buckets as I just mentioned, and it kind of assumes that the order of the backward pass is the reverse of modelparameter’s. Uh, when it kind of builds these buckets? Um, so if this if this order is not true and in many models, this, this is the case what would happen is like the buckets are not built in the optimal order and as a result. Maybe bucket two gets ready before bucket one and then bucket one needs to wait. Uh, before it can kind of schedule, it’s all reduced, so this kind of results in sub optimal performance. Uh, to kind of get around this TP now records, the, uh, order of the parameters and the first backward pause and then rebuilds the buckets in the optimal order so that you can now schedule the all reducers optimally. Uh, so this, uh, kind of showed about, like three to seven percent speed up and models like Bird and Roberta? Uh, then finally, a few miscellaneous. Uh, improvements. We’ve added better error handling in nickel by a couple of, uh, environment variables, So you can look at the documentation for these for more details. Uh, we had, like a distributed key value store in c10d This is mostly used for rendezvous and coordination, so we’ve mostly just formalized This. API added some good documentation around it for users. Uh, we’ve added, uh, Windows support for c10d and I’d like to thank Microsoft for this contribution. Um, so now what’s coming soon in Pytorch distribute. So this is probably in the short term. Maybe, like Python, 1.8 or Python 1.9 um, we’re adding point to point communication support and process group and c10d So this is built upon, uh, nickels, uh, point-to-point send and receive support, uh, were adding native GPU support for the RPC framework, so you can send and receive GPU tensors over our PCC seamlessly. Um, we’re going to add a remote module remote device, kind of API for distributed model parallelism. So you don’t have to use the RPC framework directly. If you want to like, just play some module or part of a model on a different host different GPU, you can just use a remote module to kind of do that, and it’ll be a nice high level. API we’re going to add pipeline parallelism to Pytorch. Um, so this is a very popular way of training models, which don’t fit on a single GPU, so it’ll be very useful for training much larger models. Um, and then finally we we’ve had we’re going to add a ctnd extension, uh, to, uh, support Third-party collective communication libraries And I’d like to thank Nick. Uh, sorry, thank intel for this contribution, okay, So that was short term now. If we think about more longer term, maybe a year from now or even longer, uh, what we are thinking about and by touch distributed. Uh, so we’re thinking about, adding, uh, zero style. Uh, uh, training framework for really large models. So this was, like, uh, a very interesting paradigm. Uh, that was, uh, uh, that was proposed by Microsoft and we’re trying to incorporate that into, uh, Python distributed. Um, then we’re planning to add intra-layer parallelism. So this is a very interesting technique that was used by Megatron from Nvidia to kind of train, uh, large transformer models. Um, then we’re also planning to add like torch grip support for c10d API. So today, if you have like a very complex model with some collective communication within the model, Uh, you can’t really talk script that model, uh, because this is not possible for c10d API, so we’re planning to add support for that. Uh, then we have like auto tuning for DDP. So DDP has many parameters that need to be tuned manually today. Uh, for example, like the bucket size for the buckets that I talked about. So we’re planning to add some auto tuning where our users don’t have to tune DDP for their particular environments. Then we have an idea called hybrid parallelism, where we’re planning to make sure that things like pipeline parallelism monopolism data, parallelism and things like even interlayer parallelism work together seamlessly, so users can kind of mix and match and figure out what’s the best training paradigm for them and then the next step to that is like once you have hybrid parallelism. Can we kind of automate this completely in terms of like the user just gives us a model and their training resources, and then we figure out what combination of, uh, hybrid parallelism kind of works best for this model and the user doesn’t have to worry about this. Finally, I would like to share this. Uh, page of like, distributed overview on Python. So it’s a place where you have all the information for Python Disability. So I’ll definitely recommend that you check this out, that’s. All I had thank you very much and thank you for. [MUSIC] watching!