Transcript:
[MUSIC Playing]. Sergio GUADARRAMA: So my name is Sergio Guadarrama. I’m a senior software engineer at Google Brain, and I’m leading the TF agents team. Eugene Brevdo: And I’m Eugene Brevdo, a software engineer on Google Brain Team. I work in reinforcement learning. Sergio GUADARRAMA: So today we’re going to talk about reinforcement learning. So how many of you remember how you learned how to walk? You stumble a little bit. You try one step. It doesn’t work. You lose your balance. You try again. So when you’re trying to learn something that is hard and you need a lot of practice. You need to try many times. So this cute little robot is basically trying to do that– just moving the legs. It doesn’t coordinate very well, but it’s trying to learn how to walk. After learning to try multiple times– in this case, 1,000 times–, it learns a little bit How to move the first steps moving forward a little bit before falling off. If we let it train a little longer, then it’s able to actually walk around, go from one place to another and find their way around the room. Probably you have heard about all the applications of reinforcement learning over the last couple of years, you know, including recommended systems data, Certain [INAUDIBLE]. Real robots like that chemistry math. This little [ Killer ] robot, but also like AlphaGo that play. Go like better than any human. Now I have a question for you. How many of you have tried to actually implement an RL algorithm, OK? I see quite a bit the hands. Very good, It’s hard [LAUGHTER] Yeah. We went through that pain too. Many people who try get the first prototype right away. It seems to be working, but then you miss a lot of different pieces. All the details have to be right. All the things, all the bugs everything because it’s very unstable [INAUDIBLE] is a feature. So there’s a lot of pieces. A replay buffer–. There’s a lot of things you need to do. So we suffer through the same problem at Google, so we decided to implement a library that many people can use. And today the Tf-agent’s team is very happy to announce that it’s available online. You can go to Github. You can Pip install it and start using it right way. And hopefully you will provide feedback contributions so we can make this better over time. So now what is TF-Agents and what it provides? So we tried to make it very robust, very scalable and easy to use reinforced learning for TensorFlow. So it’s going to be easy to debug easy to try and easy to get good things going. For people who are new to reinforced learning, we have colabs and things, documentation and samples. So you can learn about it. For people who want to really solve a real problem, a complex problem, we have already ways to state-of-the-art algorithms and apply [ very quickly ]. For people who are researchers and want to develop new RL algorithms, they don’t need to build all of the single pieces. They can build on top of it. We make it well-tested and easy to configure. So you kind of start doing your experiments right away. We build on top of all the goodies of Tensorflow 2.0 that you just saw today like TF-Eagers to make the development and the debugging a lot easier tfkeras to build the networks on and models tffunction when you want to make things to go faster. And then we make it very modular and extensible. So you can cherry pick. It’s elaborate You. Can cherry pick with the pieces that you need and extend them as you need it. And for those who are not ready for the change yet we make it compatible with TensorFlow 1.14. So if we go back to the little sample of the little robot trying to walk, this is in a nutshell. How the code looks like. You have to define some networks in this case, an active distribution network and a critic network and then an actor and an agent [INAUDIBLE] agent in this case. And then assuming we have some experience already collected, we can just train through it. Tf-agent’s provide a lot of RL algorithms. RL Environment already like [INAUDIBLE], Atari, Mujoco, PyBullet DM-Control. And maybe your soon. We also provide state-of-the-art algorithms, including Dqn, Td3, Ppo [INAUDIBLE], and many others, And they are more coming soon and hopefully more from the community. They are fully tested with quality regression tests and a speed test to make things keep working. As an overview of the system, it looks like that. On the left side, you have all the collection aspects of it. And we’re going to have some policy, It’s going to interact with the environment and collect some experience. Probably we put in some replay buffers to work later. And on the right side, we have all the training pipeline where we’re going to write from this experience and our agent is going to learn to improve the policy by training a neural network. Let’s focus for a little bit in this environment. How do we define a problem? How do we define a new task? Let’s take another example. In this case, it’s Breakout. The idea is like. You have to play this game. Move the paddle left and right and try to break the bricks on the top. You break the bricks. You get rewards, so the points go up. You let the ball drop. Then the points go down. So the agent is going to receive some observation in this case. Multiple frames. From the environment, it’s going to decide which action to take and then based on that, it’s going to get some reward And then just loop again. How this looks into the code is something like this. You define the observation and specification It’s like. What kind of observation does this environments provide? In this case, it could be frames, but it could be any tensor, so any other information– multiple cameras, multiple things, And then the action Specs– it’s like. What actions can I make in this environment? In this case, only left and right, But in many other environments, we have multiple options. Then a reset method because we’re going to play this game a lot of times, so we have to reset the environment and then a step method. Taking an action is going to produce a new observation and give us a reward. Given that we could define a policy by hand, for example and start playing this game. You just create an environment, define your policy, reset the environment and start looping over it and start playing the game. If your policy is very good, you will get a good score. To make the learning even faster, we make these parallel environments so you can run these games in parallel multiple times and wrapped in TensorFlow so it will go even faster and then do the same loop again. What happened in general is like we don’t want to define these policies by hand. So let me hand it over to Eugene. Who’s going to explain how to land those policies? Eugene Brevdo:. Thank you, Sergio. So yeah, as Sergio said, we’ve given an example of how you would interact with an environment via a policy And we’re going to go into a little bit more detail and talk about how to make policies how to train policies to maximize the rewards. So kind of going over it again. Policies, take observations and emit a distribution over the actions. And in this case, the observations are an image or a stack of images. There is an underlying neural network that converts those images to the parameters of the distribution. And then the policy emits that distribution or you might sample from it to actually take actions. So let’s talk about networks. I think you’ve seen some variation of this slide over and over again today. A network in this case, a network used for deep Q. Learning is essentially a container for a bunch of keraslayers. In this case, your inputs go through a convolution layer and so on and so forth And then the final layer emits logit’s over the number of actions that you might take. The core method of the network is the call. So it takes observations in a state, possibly an RNN state and emits the logits in the new updated state, OK? So let’s talk about policies. First of all, we provide a large number of policies, some of them specifically tailored to particular algorithms and particular agents. But you can also implement your own. So it’s useful to go through that. So a policy takes one of more networks. And the fundamental method on a policy is the distribution method. So this takes the time step. It essentially contains the observation passes that through one or more networks and emits the parameters of the output distribution in this case logits. It then returns a tuple of three things. So the first thing is a actual distribution object. So Josh Dylan just spoke about this with probability, and here’s. This little probability category called distribution built from those logits. It emits the next state again, possibly containing some RNN state information and it also emits site information. So site information is useful. Perhaps you want to emit some information that you want to log in your metrics. That’s not the action or you maybe want to log some information that is necessary for training later on. So the agent is going to use that information to actually train, OK? So now let’s actually talk about training. The agent class encompasses the main RL algorithm and that includes the training and reading batches of data and trajectories to update the neural network. So here’s a simple example. First you create a deep. Q learning agent. You give it a network. You can access a policy specifically a collection policy from that agent. That policy uses the underlying network that you passed in and maybe performs some additional work. Like maybe it performs some exploration like Epsilon greedy exploration and also logs site information that is going to be necessary to be able to train the agent. The main method on the agent is called train. It takes experience in the form of batch trajectories. These come, for example, from a replay buffer. Now, assuming you have trained your networks and you’re performing well during data collection, you also might want to take a policy that performs more greedy action and doesn’t explore it all. It just exploits. It takes the best actions that it thinks are the best and doesn’t log any site. Information doesn’t admit any site information. So that’s the deployment policy. You can save this SaveModel, for example, and put it into the frame. So a more complete example– again. Here we have a deep. Q Learning network. It accepts the observation and action specs from the environment and some other arguments, describing what kind of Kera’s layers to combine. You build the agent with that network. And then you get a tfdata data set. In this case, you get it from a replay buffer object. But you can get it from any other data set that emits the correct form and trajectory batch trajectory information. And then you iterate over that data set, calling agenttrain to update the underlying neural networks, which are then reflected in the updated policies. So let’s talk a little bit about collection. Now, given a collection policy– and it doesn’t have to be trained. It can have just random parameters. You want to be able to collect data. And we provide a number of tools for that Again. If your environment is something that is in Python, you can wrap it. So the core tool for this is the driver. And going through that first, you create your batched environments at the top. Then you create a replay buffer. In this case, we have a TF uniform replay buffer. So this is a replay buffer backed by TensorFlow variables. And then you create the driver. So the driver accepts the environment the collect policy from the agent and a number of callbacks. And when you call driverrun what it will do is it will iterate in this case. It will take 100 steps of interaction between the policy and the environment create trajectories and pass them to the observers. So after driverrun has finished, your replay buffer has been populated with a hundred more frames of data. So here’s kind of the complete picture Again. You create your environment. You interact with that environment through the driver given a policy. Those interactions get stored in the replay buffer. The replay buffer you read from with the Tfdata data set And then the agent trains with batches from that data set and updates the network underlying the policy. Here’s kind of a set of commands to do that. If you look at the bottom, here’s that loop. You call it driverrun to collect data. It stores that in the replay buffer. And then you read from the data set generated from that replay buffer and train the agent. You can iterate this over and over again, OK? So we have a lot of exciting things coming up. For example, we have a number of new agents that we’re going to release– C51, D4PG and so on We’re adding complete support for contextual bandits that are backed by neural networks to the API Were going to release a number of baselines as well as a number of new replay buffers. So, in particular, we’re going to be releasing some distributed replay buffers in the next couple of quarters and those will be used for distributed collection. So distributed collection allows you to parallelize your data collection across many machines and be able to maximize the throughput of your training. URL algorithm that way We’re also working on distributed training using TensorFlow’s new distribution strategy API, allowing you to train at a massive scale on many GPUs and TPUs And were adding support for more environments. So please check out TF-Agents on Github And we have a number of colabs. I think eight or nine as of this count exploring different parts of the system. And as Sergio said, Tf-agent’s is built to solve many real world problems, And in particular, were interested in seeing what your problems are, for example, where we welcome contributions for new environments. New RL algorithms for those of you out there. Who are RL experts? Please come chat with me Or Sergio. After the talks or file an issue on the Github issue Tracker And let us know. Let us know what you think. Thank you very much [MUSIC PLAYING].