From the amazing results and vintage. Atari Game’s, deep Minds victory with Alphago stunning breakthroughs in robotic arm manipulation and even beating professional players at 1v1 Dota. The field of reinforcement learning has literally exploded in recent years ever since the impressive breakthrough on the Imagenet Classification challenge in 2012 The successes of supervised deep learning have continued to pile up and people from many different backgrounds have started using deep neural nets to solve a wide range of new tasks, including how to learn intelligent behavior in complex dynamic environments. So in this episode, I will give a general introduction into the field of reinforcement learning as well as an overview of the most challenging problems that were facing today If you’re looking for a solid introduction into the field of deep reinforcement learning, then this episode is exactly what you’re looking for. My name is Xander and welcome to Archive Insights [Laughter] [Music] [Music] so in its 2017 Peter Emil gave a very inspiring demo in front of a large audience of some of the brightest minds in AI and machine learning. So you showed this video where a robot is cleaning a living room, bringing somebody a bottle of beer and basically doing a whole range of mundane tasks that robots in Sci-fi movies can do without question and then at the end of the video. Peter revealed that the robot’s actions were actually entirely remote-controlled by a human operator and the takeaway from this demo. I think is a very important one. It basically says that the robots we’ve been building for decades now are physically perfectly capable of doing a wide range of useful tasks, but the problem is that we can’t embed them with the needed intelligence to do those things so basically creating useful state-of-the-art robotics is a software challenge and not a hardware problem. And so it turns out that having a robot, learn how to do something very simple like picking up. A bottle of beer can be a very challenging task and so in this video. I want to introduce you guys to the whole subfield in machine learning that’s called reinforcement learning which I think is one of the most promising directions to actually get to very intelligent robotic behavior, so in the most common machine learning applications, people use what we call supervised learning. And this means that you give an input to your neural network model, but you know the output that your model should produce and therefore you can compute gradients using something like the back propagation algorithm to train that network to produce your outputs, so imagine you want to train a neural network to play the game of Pong. What you would do in a supervised setting Is you would have a good human gamer. Play the game of pong for a couple of hours, and you would create a data set where you log all of the frames that that human is seeing on the screen as well as the actions that he takes in response to those frames, so whatever is pushing the up arrow or the down arrow and we can then feed those input frames through a very simple neural network that at the output can produce two simple actions, it’s either going to select the up action or the down action and by simply training on the data set of the human gameplay, using something like back propagation, we can actually train that neural network to replicate the actions of the human gamer, but there are two significant downsides to this approach. So on the one hand. If you want to do supervised learning, you have to create a data set to train on, which is not always a very easy thing to do and on the other hand. If you train your neural network model to simply imitate the actions of the human player. Well, then, by definition, your agent can never be better at playing the game of pong than that human gamer, for example, if you want to train a neural net to be better at playing the game of gold and the best human, then by definition, we can’t use supervised learning. So is there a way to have an agent learn to play a game entirely by itself? Well, fortunately, there is, and this is called reinforcement learning. So the framework and reinforcement learning is actually surprisingly similar to the normal framework in supervised learning. So we still have an input frame. We run it through some neural network model and the network produces an output action. We either up or down, but the only difference here is that now we don’t actually know the target label, so we don’t know in any situation whether we should have gone up or down because we don’t have a data set to train on and in reinforcement, learning the network that transforms input frames to output actions is called the policy. Network now, one of the simplest ways to train a policy network is a method called policy gradients, so the approach in policy gradients is that you start out with a completely random network. You feed that network a frame from the game engine. It produces a random up with action. You know, either up or down. You send that action back to the game engine and the game engine produces the next frame and this is how the loop continues and the network. In this case it could be a fully connected network, but you can obviously apply convolutions there as well and now in reality, the output of your network is going to consist of two numbers, the probability of going up and the probability of going down and what you will do while training is actually sample from the distribution so that you’re not always going to repeat the same exact actions and this will allow your agent to sort of explore the environment a bit randomly and hopefully discover better rewards and better behavior. Now, importantly, because we want to enable our agent to learn entirely by itself, the only feedback that we’re gonna give it is the scoreboard in the game, so whenever our agent manages to score a goal, it will receive a reward of +1 and if the opponent scored a goal, then our agent will receive a penalty of minus 1 and the entire goal of the agent is to optimize its policy to receive as much reward as possible so in order to train our policy network. The first thing we’re gonna do is collect a bunch of experience. So you’re just gonna run a whole bunch of those game frames through your network Select random actions, feed them back into the engine and just create a whole bunch of random pong games and now obviously since our agent hasn’t learned anything useful yet it’s gonna lose most of those games, but the thing is that sometimes our agent might get lucky. Sometimes it’s going to randomly select a whole sequence of actions that actually lead to scoring a goal. And in this case, our agent is going to receive a reward and a key thing to understand. Is that for every episode? Regardless of whether we want a positive or a negative reward, we can already compute the gradients that would make the actions that our agents has chosen more likely in the future, and this is very crucial and so what policy gradients are going to do is that for every episode where we’ve got a positive reward, we’re going to use the normal gradients to increase the probability of those actions in the future. But whenever we got a negative, we’re gonna apply the same gradient, but we’re gonna multiply it with minus one, and this minus sign will make sure that in the future, all the actions that we took in a very bad episode are going to be less likely in the future and so the result is that while training our policy network, the actions that lead to negative rewards are slowly going to be filtered out, and the actions that leads to positive rewards are going to become more and more likely so in a sense. Our agent is learning how to play the game of pong now. I know this was a very quick introduction to reinforcement learning. So if you want to read up a bit and spend a little bit more time in thinking about the details. I really recommend to read and rake carpet. These blog post pong from pixels. It does a phenomenal job at explaining all the details, all right, so we can use policy gradients to train a neural network to play the game of pong. That’s amazing, right well. Yes, it is, but as always, there are a few very significant downsides to using this methods, lets. Go back to pong one more time. So imagine that your agent has been training for a while, and it’s actually doing a pretty decent job at playing the game of Pong, It’s bouncing the ball back and forth, but then at the end of the episode. It makes a mistake it. Lets the ball through and it gets a negative penalty. So the problem with policy gradients is that our policy gradient is going to assume that since we lost that episode, all of the actions that we took, there must be bad actions and is going to reduce the likelihood of taking those actions in the future. But remember that, actually, the most part of that episode we were doing really well, so we don’t really want to decrease the likelihood of those actions and in reinforcement learning. This is called the credit assignment problem. It’s where if you get a reward at the end of your episode, Well, what are the exact actions that led to that specific reward and this problem is entirely related to the fact that we have what we call a sparse reward setting so instead of getting a reward for every single action, we only get a reward after an entire episode and our agent needs to figure out what part of its action sequence we’re causing the reward that it eventually gets so in the case of punk, For example, our agent should learn that it’s only the actions right before it hits the ball that are truly important everything else. Once the ball is flying off, it doesn’t really matter for the eventual reward, and so the result of this sparse reward setting is that in reinforcement learning algorithms are typically very sample inefficient, which means that you have to give them a ton of training time before they can learn some useful behavior, and I’ve made a previous video to compare the sample efficiency of reinforcement learning algorithms with human learning that goes much deeper into why this is the case, and now it turns out that in some extreme cases, the sparse reward setting actually fails completely So a famous example is the game. Montezuma’s Revenge where the goal of the agent is to navigate a bunch of ladders. Jump over a skull. Grab a key and then actually navigate to the door. – in order to get to the next level and the problem here is that by taking random actions. Your agent is never gonna see a single reward because you know the sequence of actions that it needs to take to get that reward is just too complicated. It’s never gonna get there with random actions, and so your policy gradient is never gonna see a single positive reward, so it has no idea what to do in the same case applies to robotic control. Where, for example, you would like to train in robotic arm to pick up an object and stack it onto something else well. The typical robot has about seven joints that it can move, so it’s a relatively high action space, and if you only give it a positive reward when it’s actually successfully stacked a block well by doing random exploration, it’s never gonna get to see any of that reward, and I think it’s important to compare this with the traditional supervised deep learning successes that we get into something like computer vision, for example, so the reason computer vision works so well, is that for every single input frame, You have a target label and this lets you do very efficient gradient descent with something like back propagation, whereas in a reinforcement learning setting, you’re having to deal with this very big problem of sparse reward setting, and this is why you know, Computer vision is showing some very impressive results, while something is simple as stacking one block onto another seems very difficult Even for state-of-the-art woodland [Music] and so the traditional approach to solve this issue of sparse rewards has been the use of rewards shaping so reward, Chipping is the process of manually designing a reward function that needs to guide your policy to some desired behavior. So in the case of Montezuma’s revenge, for example, you could give your agent a reward every single time it manages to avoid the skull or reach the key, and these extra rewards will guide your policy to some desired behavior, and while this obviously makes it easier for your policy to converge to desired behavior. There are some significant downsides to reward shaping, so firstly, reward shaping is a custom process that needs to be redone for every new environment. You want to train a policy? So if you’re looking At the benchmark of Atari, for example, well, you would have to craft a new reward function for every single one of those games. That’s just not scalable, the second problem. Is that we? Ward shaping suffers from what we call the alignment problem, so it turns out that reward shaping is actually surprisingly difficult in a lot of cases. When you when you shape your reward function, your agent will find some very surprising way to make sure that it’s getting a lot of reward, but not doing at all what you wanted to do. And in a sense, The policy is just overfitting to that specific reward function that you designed while not generalizing to the intended behavior that you had in mind, and there’s a lot of funny cases where reward shaping goes terribly wrong so here, For example, the agent was trained to do jumping and the reward function was the distance from its feet to the ground and what this agent has learned is to simply grow a very tall body and do some kind of a backflip to make sure that its feet are very far from the ground to give you one final idea of how hard it can be to the reward shaping. I mean, look at this shaped reward function for a robotic control task. I don’t even want to know how long the people from this paper spent on designing this specific reward function to get the behavior that they wanted and finally in some cases like alphago, for example, by definition, you don’t want to do any reward shaping because this will constrain your policy to the behavior of humans, which is not exactly optimal in every situation, so the situation that we’re in right now is that we know that it’s really hard to train in a sparsely setting, but at the same time, it’s also very tricky to shape a reward function and we don’t always want to do that and to end this video. I would like to note that a lot of media stories, picture reinforcement learning as some kind of a magical. Ai sauce that lets the agent learn on itself or improve upon its previous version, but the reality is that most of these breakthroughs are actually the work of some of the brightest minds alive today, and there’s a lot of very hard engineering going on behind the scenes, so I think that one of the biggest challenges in navigating our digital landscape is discerning truth from fiction in this ocean of clickbait that is powered by the advertisement industry. And I think the Atlas robot from Boston. Dynamics is a very clear example of what I mean, so I think if you go out on the streets, and you ask a thousand people with the most advanced robots today are well, They would probably point to Atlas from Boston Dynamics because everybody has seen the video where it does a backflip. But the reality is that if you think about what’s what Boston Dynamics is actually doing. Well, it’s very likely that there’s not a lot of deep learning going on there. If you look at their previous papers in the research track record. Well, they’re they’re doing a lot of very advanced robotics. Don’t get me wrong, but there’s not a lot of self driven behavior. There’s not a lot of intelligent decision-making going on in those robots, so don’t. Get me wrong! Boston Dynamics is a very impressive robotics company, but the media images they’ve created might be a little bit confusing to a lot of people that don’t know what’s going on behind the scenes, but nonetheless, if you look at the progress of research that is going on, I think we should not be negligible of the potential risks that these technologies can bring so. I think it’s very good that a lot more people are getting involved in the whole, A AI safety research because this is going to become very fundamental threats like autonomous weapons and mass surveillance are to be taken very seriously, and so the only hope we have is that international law is going to be somewhat able to keep up with the rapid progress We see in technology, but on the other hand. I also feel like the media is focusing way too much on the negative side of these technologies, simply because people fear what they don’t understand and we’ll fear sells more advertisement than Utopias. So I personally believe that Most If not, all technological progress is beneficial in the long run, as long as we can make sure that there are no monopolies that can maintain or enforce their power with the malignant use of. Ai, well, anyway. Enough politics for one video. So this video was an introduction into deep reinforcement learning and an overview of the most challenging problems that were facing in the field. In the next video, I will dive into some of the most recent approaches that try to tackle these problems of sample efficiency and the sparse reward setting so specifically, I will cover a few technical papers. Dealing with approaches like auxilary or reward settings, intrinsic curiosity, hindsight, experience replay And so on. I’ve also seen that a few people have chosen to support me on Patreon for which I would just like to say thank you very much. I mean, it really means a big deal to me. I’m doing these videos completely in my spare time and knowing that there’s people out there that appreciate this content really feels great, so thank you very much, Thanks for watching. Don’t forget to subscribe and I’d love to see you again in the next episode of archived insights, you.