Transcript:

Hello. Wizards, this is Colin Scow representing the Las Vegas chapter the school of Ai and we’re going to learn about one of the most fundamental and important mathematical equations in the world of reinforcement. Learning the bellman equations For those of you who don’t know me. I’m a professional software developer and Ai entrepreneur. I barely studied any math, but I’ve managed to become productive in solving real world problems with neural networks. Anyway, if I can figure it out, you can too. So my goal in helping Suraj, with this course is to make the material as accessible as possible to other non. Mathies, like myself. The topic of bellman equation is so important that if you don’t develop a very good intuition of how this works early on, you’re definitely going to feel lost and confused later in the course. Okay, so before we move on. I want to give you a few important study. Tips to make sure you’re fully able to master this material. All right, whenever you’re learning a new subject, not just reinforcement learning is important that you get very comfortable with the nomenclature, or that means the vocabulary of a subject. Whatever a new term or concept is introduced, pause to make sure you fully comprehend it. Research, other sources draw diagrams turn equations into computer codes. Do whatever it takes so that when you hear the word, you instantly understand what it means without having to think if you go past terminology, you don’t fully understand. You won’t learn much studying to fully comprehend and be able to apply what you learn in the real world takes patience and discipline. So this is how? I’m able to learn advanced topics quite quickly, even with no previous experience. Alright, so before we get into the bellman equations. I want to quickly review. The terminology. You’ll need to understand, all right, So state is a numeric representation of what our agent observe in the game world at a particular point in time, but so this is often raw pixels on the screen, right action, the input we provide to the gaming environment calculated by applying a policy to the current state, so this basically is a combination of control buttons being pressed at any given time or say analog input from a joystick right reward a feedback signal from the environment reflecting. How well the agent is performing at the goals of the game? Such as coins collected enemies killed points from reaching checkpoints in the level etc. Alright, so to review. What is it that we’re trying to accomplish and reinforcement learning? Our goal is given the current state were in choose the optimal action, which will maximize a long-term expected reward provided by the environment. So now, with a solid understanding of the basics, it should be fairly simple to understand the bellman equation. In fact, the roots of reinforcement learning actually go all the way back to the work of dr. Richard Bellman in 1954 Right, Dr. Belman is also known as the father of dynamic programming, right and the bellman equations are actually examples of dynamic programming, But so even though, we’re going to cover this in a lot more detail on it in the next unit. It’s important that you have a 50,000 foot overview of what that programming is. There’s a it’s a class of algorithms which seek to simplify complex problems by breaking them into smaller subproblems and solving the Subproblems recursively and by recursively, that means a function that calls itself over and over again until it comes up with the right solution. All right, so what question exactly does the bellman equation Help us answer? What are we solving for, all right, so given the state? I’m in assuming that I take the best possible action now and continue to do so in each subsequent step. What long-ter’m reward Can I expect another way to put? This is what is the value of a state and value is a key word. The development equation is solving for obviously not all states are an equal value if I’m low on health out of ammo and surrounded by superior enemies, even taking the best action may not dig me out of the hole on the other hand for any of you have played quake, but I’ve got a lightning gun quad damage in Mega L. So we can expect a very high value for the state, assuming that I choose the optimum actions, all right, so why is this important when our AI is deciding on optimal policy, we want to making decisions based on the best it can do given the state it’s in, so if a robot falls over and manages to get up, the getting back up is something we want to reinforce, even though the reward for having fallen over may be lower even negative, so in other words, the bellman equation helps us evaluate the expected reward relative to the advantage or disadvantage of each state that we find ourselves in, all right, so now let’s look at what the most basic bellman equation looks like, right, and this is for deterministic environments, right, so the the value of a given state is equal to alright here. We have Max action. Max action means that all the actions available in the state were in weight. We pick the action, which is going to maximize the value, right, so we take the reward of the action of the optimum in action, A and state S right, and we add to that here. We have a multiplier of gamma. That’s the discount factor, which diminishes our reward over time every time we take an action we get back to the next state, which is s. Prime, all right, and out here, The V function. This is where dynamic programming comes in because it’s recursive, so we we take the we take an action and the state we’re in. Alright, We get a reward back and we get s. Prime back so now we take that s Prime, and we put it back into the value function and we continue that until we hit the terminal state and the episode is over and then from there, we know the the value of the state we’re in assuming we choose the optimal action and so that catch is that at each step, we have to know what the optimal action is, which will maximize the value of the expected. Long-ter’m discounted rewards in the days before the deep learning revolution. This was traditionally done by trying out all possible actions in. Ai Janet gigantic search tree. So a good example is when deep blue beat Kasparov in 1997 but brute force can quickly get out of hand and isn’t very practical for complex tasks like playing video games from raw pixels, That’s where the deep neural networks come to the rescue instead of brute force in every possible action every possible state, the network can simply estimate the value of the state and intelligently Guess which action will maximize it. You’ll learn a lot more about that later, All right, so for now, let’s look at a practical real-world example of brute, forcing the Bahman equation to avoid lava pits and rescue the princess. Right, this is sort of start out with where to use a simplest possible scenario and assume our actions are completely deterministic. That is if we’re here at the starting point and we choose move up is our action 100% of the time we’ll end up in the state. That was right about where we started. All right, so this is our starting. Square here, right, and our goal is to come over here. If we rescue the princess here, we get the reward of 1 and if we unfortunately end up here and follow the lobby pin, we get a reward of Negative 1 or I guess better call the punishment since our moves are deterministic, though we’d have to be pretty stupid to end up there, so it’s unlikely that any decent agent would fall into that trap. Alright, so now. I’m going to show you why the discount factor Gamma is so important, so let’s say gamma is one essentially means there’s no discounting whatsoever, Right, We have the problem with sparse rewards Because here our only reward is here. We have a punishment here. Every other square is zero. Alright, so with no discount, you can sort of see all the states end up with equal values of one. So you can basically see how we basically end up wandering around aimlessly until we accidentally someone to the goal, right. Hmm, so a little bit about gamma gamma is one of those hyper parameters that’s super important to tune to get optimal resorts results. Its so successful values are usually between point, Nine and point nine nine. A lower value tends to encourage short-ter’m thinking, while a higher value emphasizes long-term rewards. All right, so now let’s go through a practical application of the bellman equation. With a discount factor, less than 1 right, generally in reinforcement learning will play through an entire episode, destroying our state action reward and S prime transitions in a list. Then we work backwards. Starting from the end, we add the reward obtained from the current action and add the calculated value of the next step multiplied by the discount factor. All right, watch exactly how this works were going to use a discount factor of 0.9 All right, Mm-hm’m. So then since our equation is assuming we’re taking an ideal action and usually its best to work backwards from the end of the episode. So we’re in this square here. It’s right next to the goal. The optimum action is to go right into the goal, which will give us a reward one, so we take our reward of 1 plus discount factor 0.9 and then we multiply it by the value of the next state, which says is this is a terminal state? There was no value, so that’s zero here. So they value of this square is 1 all right now! Let’s work backwards from there. All right, sign out here. We take our reward of 0 because going into this state there. The reward of 0 the only reward and punishments are here. Everything else is 0 so 0 plus 0.9 Plus, the value of this square is 1 which gives us a value of 0.9 So basically, we just continue working backward, multiplying the discount factor by the previous value. And we get to the starting square, all right, and we notice there are actually two possible paths with equal values, All right, so what kind of policy? Once we calculate the value then? Basically, the policy would be we’re going to take the action that’s going to bring us into the state with the highest value, so it’s easy to see. We have a very easy to see path here or we could follow around this path right now. Keep in mind! The equation assumes that we’re choosing the optimal action each state, so we haven’t yet learned how to figure that out. So throughout this course, you’re going to learn various techniques to estimate the optimal action given imperfect information. The velopment equation is the basic premise that everything else builds on. So watch this video as many times as you need until you fully understand. But you may also want to research other videos and papers on the subjects to give you a more deeper understanding. Investing time in understanding the basic concepts of reinforcement learning is well worth it. Because when we get to the more advanced stuff, it means that you’re not going to have a trouble, and so in this simple example, we’ve assumed that take action from the State S will always have the same outcome. Alright, next we’re gonna look at what happens when we introduce some randomness. Alright, what happens if we get? Mario, nice and happy. So he randomly studies, so he randomly stumbles twenty percent of the time when we show come in, right, I’ll see you back here soon where I’ll expand on the bellman equation even further to handle this.