In this video? I’m going to tell you everything you need to know to start solving reinforcement, learning problems with policy gradient methods. I’m gonna give you the algorithm and the lamentation Details upfront and then we’ll go into how it all works and why you would want to do it. Let’s get to it, So here’s a basic idea behind policy creating methods. A policy is just a probability distribution. The agent uses to pick actions, so we use a deep neural network to approximate the agent’s policy. The network takes observations of the environment as input and outputs actions selected, according to a softmax activation function next generation episode and keep track of the state’s actions and rewards in the agent’s memory at the end of each episode, go back through these States actions and rewards and compete and compute the discounted future returns at each time step, use those returns as weights and the actions the agent took as labels to perform back propagation and update the weights of your deep neural network And just repeat until you have a kick-as’s agent simple. Yeah, so now we know that what, let’s unpack. How all this works and why it’s something worth doing. Remember with the reinforcement learning we’re trying to maximize the age’s performance over time. Let’s say the agent’s performance is characterized by some function J and it’s a function of the weights theta of the deep neural network, so our update rule for theta is that the new theta equals the old theta. Plus some learning rate times, the gradient of that performance metric know that we want to increase performance over time, so this is technically gradient ascent instead of gradient descent. The gradient of this performance metric is going to be proportional to sum over States for the amount of time we spend in any given state and a sum over actions for the value of the state action pairs and the gradient of the policy. Where, of course, the policy is just the probability of taking each action given we’re in some state, this is really an expectation value, and after a little manipulation we arrive at the following expression. When you plug that into the update rule for theta, you get this other expression. There are two important features here this. G Sub T term is the discounted feature returns we referenced in the opening and this grading of the policy divided by the policy is a vector that tells us the direction in policy space that maximize the chance that we repeat the action ace of T. When you multiply the two, you get a vector. That increases the probability of taking actions with high expected future returns. This is precisely how the agent learns over time and what makes policy gradient methods so powerful. This is called the reinforce, however, by the way. If we think about this one enough, some problems start to appear for one, it doesn’t seem very sample efficient at the top of each episode, we reset the aeneas memory, so it effectively discards all its previous experience, aside for the new weights that parameterize it’s policy, It’s kind of starting to scratch after every time it learns worse yet if the agent has some big probability of selecting any action in any given state. How can we control? The variation between the episodes for our state spaces are thrown way too many combinations to consider. Well, that’s actually a non-trivial problem of policy gradient methods, Then part of the reason our agent wasn’t so great at space invaders. Obviously, no reinforcement. Learning method is going to be perfect and we’ll get to the solution to both of these problems here in a minute, but first let’s talk about why we would want to use policy gradients at all, given these shortcomings. The policy gradient method is a pretty different approach to reinforcement learning many reinforcement, learning algorithms like deep. Q learning, for instance, where I am estimating the value of the state or state action pair, in other words. The agent wants to know how valuable each state is so that it’s. Epsilon greedy policy can let it select the action that leads to the most valuable states. The engine repeats this process over and over occasionally choosing random actions to see if it’s missing something. The intuition behind. Epsilon greedy action selection is really straightforward, figure out what the best action is and take it. Sometimes do other stuff to make sure you’re not mildly wrong. Okay, that makes sense. But this assumes that you can accurately learn the action value function to begin with in many cases. The value or action value function is incredibly complex and really difficult to learn on realistic time scales. In some cases, the optimal policy itself may be much simpler and therefore easier to approximate This means the policy grading agent can learn to beat certain environments much more quickly than if I relied on an algorithm like deep Q learning, another thing that makes policy gradient methods attractive is what if the optimal policy is actually deterministic and really simple environments with an obvious deterministic policy like our grid world example, keeping a finite epsilon means that you keep on exploring even after you’ve found the best possible solution. Obviously this is optimal for more complex environments. The optimal policy may very well be deterministic, but perhaps it’s not so obvious and you can’t guess at it beforehand. In that case, one could argue the deep. Q Learning would be great because you can always decrease the exploration factor Epsilon over time and allow the agent to settle on a purely greedy strategy. This is certainly true, but how can we know how quickly to decrease? Epsilon, the beauty of policy gradients is that even though they are stochastic, they can approach a deterministic policy over time actions that are optimal will be selected more frequently and this will create a sort of momentum that drives the agent towards that optimal deterministic policy. This really isn’t feasible in action value algorithms that rely on Epsilon greedy or its variations. So what about a shortcomings? As we said earlier, there are really big variations between episodes Since each time the agent visits the state, it can choose a different action, which leads to radically different future returns. The agent also doesn’t make very good use of its prior experience since it discards them. After each time, it learns while they see my showstoppers. They have some pretty straightforward solutions to deal. With the variance between episodes, we want to scale our rewards by some baseline. The simplest baseline to use is the average reward from the episode and we can further normalize the g-factor by dividing by the standard deviation of those rewards. This helps control the variance in their returns so that we don’t end up with wildly different step sizes only when we perform our update to the weights of the deep neural network, dealing with the sample and efficiency is even easier while it’s possible to update the weights of the neural net after each episode. Nothing says this has to be the case. We can let the agent play a batch of games, so it has a chance to visit a state more than once before we update the weights for our network. This introduces an additional hyper parameter, which is the batch size for our updates, but the trade-off is that we end up with a much faster convergence to a good policy. Now it may seem obvious, but increasing the batch size is what allowed me to go from no learning at all in Space Invaders with policy gradients to something that actually learns how to improve its gameplay, so that’s policy gradient learning in a nutshell, we’re using a deep neural network to approximate the ages policy and then using gradient ascent to choose actions and results in larger returns, it may be sample inefficient and have issues with scaling the returns, but we can deal with these problems to make policy Gradings competitive with other reinforcement learning algorithms, like deep view learning. If you’ve made it this far. Check out the video where I implement policy grades in Tensorflow. If you liked the video, make sure to like the video. Subscribe comment down below and Ill. See you in the next video.