Transcript:
Hi, I’m Eric. Um, here to tell you about Rlib, Scalable RL for Tensorflow pytorch and beyond so a bit about me. So I’m currently a software engineer at any skill and a finishing Phd student, UC Berkeley. I’m the team lead for rate core and Arlib at any skill, and in my research, I work on applied RL and ML Systems before grad school. I spent a number of years in industry at Databricks and at Google, So I’m going to start this talk by telling you a bit about reinforcement learning and the problem our live is solving. This talk will also cover the current project status and upcoming developments our lib. It’s not going to be a really in-depth talk about our lib for that. We have an advanced our live. Talk that Sven Mika is also giving in the summit. So why RL, uh, just as background? The main difference between RL and supervised learning is that reinforcement learning has the potential to more directly optimize for end objectives, so for example, where in supervised learning you might make predictions about data, for example, given some images, you might make predictions about the category of items in the images in aural. Uh, you’re instead training an agent or a policy to take actions in some environment, Um, and, uh, based on the actions agent takes the environment provides feedback back to the agent in terms of observations and rewards and over time. The agent is going to learn to improve the actions it takes in the environment to maximize the reward it receives so reinforcement. Learning is a very, uh, old field. But it’s actually only recently that when combined with deep learning that’s started working for real and to many applications. So you’re probably aware of, uh, you know, Alpha Zero, which, uh, has achieved, you know, state-of-the-art performance a superhuman performance in the game of go. Um, but Arlo has found success in many other domains such as E-trading ads, optimizations database, query optimization, systems, control and circuit layout and many other, besides these like supervised learning, though reinforcement learning skills would compute so many of the recent successes in reinforcement learning such as alpha 0 depend not only on algorithmic innovations, but also leveraging specialized hardware and distributed compute clusters. So what this means really is that the software for reinforcement learning is also quite important And this is the problem. Our Lib is trying to solve providing a unified reinforcement learning library that can easily scale to large clusters. What we found is that different users of RL care about different features because they are focusing on different aspects of reinforcement learning, for example, teams of research engineers care both about building RL systems and the application. The applications academic researchers care primarily about the algorithms and applied scientists and product engineers are trying to leverage systems algorithms to build their application [Music] to draw an analogy with supervised learning in that field. There are many deep learning frameworks that provide common ways to express and scale tensor computations and basically all kinds of users use these common frameworks, for example, Pytorch and tensorflow so Rlib serves the same role for reinforcement learning, it provides common apis for expressing and scaling reinforcement learning training. No matter what your use cases. So what is Rlib? So it’s an open source library and from the user perspective, it has kind of three main layers. The first layer is a unified API that makes reinforcement learning accessible from a ride, a variety of applications, so this is, of course, including benchmark environments such as Openai gym, but our LIB also supports multi-agent scenarios, um, serving policies to external systems and processing, uh, learning on offline or batch batch RL data second. It has a collection of best-in-clas’s reference algorithms. This spans a model-free and model based algorithms and other ones, and finally it has primitives for implementing a new reinforcement algorithms. And you might care about this if you’re a researcher or an RL engineer, so I’m going to first talk about our lives. Unified API, um, as an example of the benefits of a unified API here, we look at an example application called Nero MMO, so this is a massively multi-agent game environment released recently by Openai for research purposes. So despite being a kind of you know, simulated game, this is actually an extremely challenging application in RL terms. Not only you’re not only training one agent to act. You’re training kind of a group of agents or to compete or cooperate with each other. There are a dynamic number of agents in this environment since agents can kind of, you know, live and die and agents receive complex structured observations. It’s not just, you know, one single vector of features or a single image, but it’s it’s like a real world system where you have metrics telemetry and so on about nearby entities, despite this complexity, however, Neurommo can basically train out of the box on Rlip and this is possible because Rlibs Apis are general enough to cover this application and this is something no other ro library can do because they don’t have a unified API. Yeah, so why is having a unified API important so beyond the obvious? You know, software engineering reasons there? I think there’s a couple key points. So first is easy scalability. Uh, any application that runs on Rlib can kind of automatically scale with with the rate distributed system, in fact, thanks to the scalability Rlib. The neural MMO authors found that like right after they integrated with Rlip, they were able to get new state-of-the-art performance just through scale. A second reason is that a unified API allows easy experimentation for applied use cases. Even if you don’t need to scale, this is because for an applied problem, you’re experimenting with many different ways to express a problem in reinforcement learning terms, so you want a lot of flexibility to tinker with different approaches, for example, multi-agent decomposition’s, different model types and so on which Rlib allows kind of in one library without needing to switch between different software frameworks, the second thing that RLA provides is a collection of reference algorithms, so here’s a list of algorithms from the documentation of our lib at our libio. ARLA provides a cohesive API across more than 14 tensorflow algorithms and 18 PI torch algorithms. It lets you easily scale all these algorithms from a laptop to a cluster and also customize these algorithms for complex use cases, for example, multi-agent RL as needed so to give some credit. Uh, here the these algorithms come from a variety of community. Contributors obviously writes lab and any skill, but also a number of other companies and university groups. So yeah, we’re very grateful for these contributions because obviously a single organization has a hard time, having enough expertise to maintain all these different types of arguments. Finally, if you’re an algorithms researcher or need to deeply customize an algorithm, RLIP provides primitives for building completely novel RL algorithms that seamlessly fit in with our ellipse unified API. As you can imagine parameters for building RL algorithms is a is a complex topic so here I’m going to dive specifically into how our lib scales algorithms to a cluster and the example I’m going to use for. This is scaling the basic policy gradients algorithm. If you’re not familiar with policy gradients, not to worry the math aside, the computation pattern is actually really simple. In fact, it’s just shown on this slide here. Um, the steps are basically as follows. We’re going to start from the, um, the left. So first is parallel rollouts. So with the current policy, we want to generate rollouts from the environment in parallel so basically, we want to gather experiences given the current policy. The next step is to combine these experiences together into a single data set with a concatenate operator. We want to take this concatenated data and use it to update our policy with stochastic reading descent. Finally, we broadcast a new policy to our workers and report metrics and repeat so to get still work in art. Lib, there’s a couple steps. First you, uh, you need to express these series of steps in arlit and our Lib has a domain specific language. So let’s usually do this and that’s pretty pretty much it. Uh, once you do that. Our arlo will automatically schedule and execute the algorithm with ray to make this more concrete. This is actually what the DSL looks like. This is actually copy pasted from the policy gradient’s implementation. So I’ll walk through it so that what is the execution plan for this policy gradients our room. So this is a distributed plan that runs across many workers or potentially many machines, and it looks like this so first, we have a set of workers and we’re going to tell these workers do a parallel rollouts, uh, work so rollouts in parallel to get experiences. The next step is we’re going to take those rollouts and we’re combining them together into a batches of some minimum batch size specified by the configuration. So this is the concat batches operator there and the next step is we’re going to apply those apply. This train, one step operator to update do one step of stochastic gradient descent on the policy again these experiences and, uh, this this series of steps will just repeat over and over until the policy is trained. Uh, kind of to your, uh, you know, a target reward. And, of course, we return a standard metrics reporting wrapper around around this plan that reports metrics in a standard way across all all RL algorithms. This distributed execution DSL is a new feature of rl11.0 and makes it much much easier to write new distributed items. We’ve actually already ported all the internal algorithms to this new paradigm, and it’s a huge simplification, for example, apex and Impala. Two of the more complex high performance distributed algorithms in our live have gone from four to five hundred lines of code to just one or two hundred lines of code. So keep in mind. This is real production code with, you know, Debugging log statements, metrics reporting and so on so this is a really huge, huge simplification in terms of readability, So I wanted to also give an update on the the R-lip community. So, uh, and for Ray, we have a a slack and there’s the R-lip slack channel, which has more than a thousand users. So you can join this at our libio. We’ve seen a steady growth in user engagement on Github. Um, users have reported many novel use cases that help guide our roadmap. And, uh, yeah, there’s just been a lot of growth and issues reported about our live on on Github, which is a measure of user engagement, so this graph is showing the issue’s. New issues reported, uh, are live specific specifically for our Lib per month across the past two years, and as you can see, we’re seeing an accelerated number of issues per month, especially in the past few months, Our Lib is also a part of several industry RL platforms today. Um, several, which are public and it’s also used internally by many more, so some of the public ones are Amazon stage maker RL, uh, Azure, RL, bonsai and skymind so what’s? Next for Rlib, here are some of the top issues raised by users. First we hear there’s a lot of community interest in frameworks like Pytorch and Jaxx users are also very interested in model based reinforcement learning so model based RL is kind of a rapidly advancing field of research and reinforcement learning right now, and has the promise for a much greater efficiency, making reinforcement learning actually practical for a task where it’s expensive to collect experiences. Users are interested in complex multi-agent use cases, both for research and also applied use cases and as as models have advanced in the deep learning field, users are interested in leveraging, uh, more more powerful models such as transformers that you know, self-attend across time or or or for models for handling complex observations. So what are we doing about this? Well, I’m happy to say that with Arlo 1.0 Uh, Pytorch now has 100 parity with, uh, our lip tensor flow. Actually, we have actually more, uh, more pie chart diagrams in tensorflow now because it’s just simply easier to add new items in in Pi Torch for model-based reinforcement learning and in our live 1.0 we have nvmpo and dreamer Fully implemented and tested and to support some of these more complex use cases were adding two new Apis, uh, first of the new distributed execution DSL that I described, uh, in previous slides. This is fully stable and is now the new way to write distributed arguments in Rlib and we’re also adding a new trajectory view API that allows, um, that’s going to allow high performance models such as transformers and lcms to work very seamlessly, so in summary, rlib is the scalable and unified. RL Library has a number of new capabilities with rate 1.0 So if you’re interested in using rlab, we’re getting involved, you can check out our documentation or slack at our left IO, and we’re also hiring for rlib and rate development at any scale. Yeah, thank you.