Transcript:
In this video, we’re gonna cover active reinforcement learning. This should extend what we studied in class, which included some high-level discussion of general types of reinforcement learning as well as a more in-depth view of passive reinforcement, learning and well. Talk about that a little bit in this video, too. So the outline is that we’re going to talk about active reinforcement learning, motivated by motivated by looking at passive reinforcement learning first and we’ll talk about why active makes more sense and then we’ll talk about active, adaptive dynamic programming or active ADP and then Q learning, which is a new view of reinforcement learning. And then what we’ve seen so far, so let’s review passive learning. So when we think about what we’ve looked at in class, we were examining the scenario where we, as the learning algorithm get access to essentially recordings of an agent. That’s running some fixed policy, right, so we just sort of watched this agent. Try to get rewards in some environment just by running some policy, and then after it finishes the runs off after we receive all these recordings, we can go in and use these the rewards we that we saw the states that we, the state transitions that we saw the actions that we know the policy was trying to execute and then and then attempt to estimate the utility function and that’s sort of the goal and so the first sort of thought experiment version of this was direct utility estimation, where you just essentially treat each time step each action that we’ve taken in the recordings as an example of what the future rewards will be. Future rewards will be, and then we just directly estimate the utility using those samples, so that’s direct utility estimation, then we talked about adaptive dynamic programming, which is a fancy name for essentially estimation of the transition probabilities and then running policy iteration on top of that right, using our estimated rewards and estimated transition probabilities. And finally, we talked about temporal difference, learning and this is a strategy that updates our current estimates of the utilities every single time we make a transition every single time we observe a transition where we you know taking action and we end up in a new state or maybe the same state, but some some we end up making a transition to another state and the idea is we will compare the estimated utility of there’s the state that we are in versus the the state that we transition to, and we treat that essentially as a sample of what we think our current expected future utility future reward is going to be, and we compare that to a while or a current estimate is and if it’s bigger if it’s bigger than it should be, then we increase our estimate if it’s smaller than then it should be, we decrease our estimate so pretty straightforward, so a lot. All this is in the book, so you can have a look closer or we can talk about it more in class, but let’s think about now. What are the problems with passive reinforcement learning? And there’s there’s kind of a big problem. I realized it in class, and and I sort of got tripped up on it on the board when we were talking about it because it really is the major flaw in the passive reinforcement learning framework, and that’s that we don’t really have a good way of updating our policy even after we estimate the utilities, even if we estimate the exact utilities from these schemes, we don’t know the transition probabilities or rather, at least with the directory utility estimation, we don’t know the transition probabilities, but with ADP we’re estimating the transition probabilities, but we don’t really have a very good estimate because all we can sample are the the transitions that happen when we take the actions, according to our fixed policy, according to the fixed policy that the that the robot is running around, you know, implementing, and and and that means that we have a very limited sample of what the transitions will be for any action that’s not in that original policy, and so this gets sort of fixed a little bit. If we run a passive ADP update, our agent, that’s operating the environment with a new policy and have that agent play again, and then we collect more data and then repeat, and and that’s, you know, that sort of fixes it, but it doesn’t really, because if you think about so, okay, so here’s one thing we are estimating use, and we have a pretty good idea that we can get a good estimate of you. We can estimate these utilities pretty well from from these passive recordings of an agent running around in environment, but the transition probabilities again we can only only expect to get a good estimate of the transition probabilities for the actions in our policy and what that means is that for any action that’s not in our policy. What is our estimate of the transitions we don’t know? I we don’t have any idea that meaning in maybe in grid world, it kind of makes sense to assume that the policy that the randomness is zero and said that you whenever you try to move, right, you always move, right. Maybe that’s the idea, but but we don’t know that. I mean, we don’t know there’s nothing. There’s nothing really tying the notion of, you know, move right and the action. Move right to the transition to this to the state on the right. It really is just, you know, it really is that there’s? Some action, it’s just a one and it results in some probability distribution and we have no samples of that probability distribution in passive learning, so so in in active learning, we now have the ability to choose what actions for the agent to try and that gives us the opportunity to try actions that aren’t in our policy so this gets to active learning and an active reinforcement. Learning is all about the question of exploration versus exploitation and exploitation is just the you know this marketing word for for using the best policy. We have right now and then exploration is the. UPS is not using the Apollo. The best policy we have right now and the naive approach to exploration is just to choose actions random, and that actually works fairly well in many situations, but there’s other situations where that can just be a really bad idea. And if you choose actions randomly your client, you’re pretty much guaranteed eventually to explore every single possible transition and, and therefore you get good data for all the transitions and eventually estimate everything perfectly, but it works a little bit better. If you also shrink the probability of that random action over time that is most of the time you’re going to take the optimal policy, so you get good rewards or yet you get rewards that you figured out are good from the learning so far, but then you know as you when you’re in States or that you haven’t visited often, they haven’t gotten much data on. You will very, you’ll very likely take a random action, But if you’ve already visited the state a lot of times, then you will want to take that random action, not as often, And that’s a general intuition. We might want to try to implement and one way to do it. Is this you know what? I’ve labeled on the screen as a better approach, which is that we are always going to act greedily, so we’re always gonna try to. Using our current estimates of transition probabilities and utilities pick the action they’ll give us the highest reward highest future reward, but instead of estimating the true utility, we’ll estimate a sort of over-optimistic utility estimate. And that’s this U+ that’s on the screen and this is also from the textbook, so you can have a look at there, too, So the idea is that we have this U plus, and this U plus takes on values that look a little bit different than the usual bellmen updates, right, so instead of you know, maximizing over the expected utility in the future, we estimate we we choose a max value over this function F of all the the best of the expected utility and some number and then okay, so this N number represents the number of times we’ve tried out the Tran Shinae from State S. And so you can sort of get an idea that F is gonna be this function that tells you that tells you, you know, how much should I trust my current estimate of my utility And if I don’t trust my current estimate of the utility Because I haven’t visited the state or not, or I haven’t tried this action enough times, then we’re gonna use an over estimate. We’re gonna assume that that that there’s gonna be some magical, amazing reward. If we try this action, okay, so so here’s an example of one of these functions of a very simple one that the textbook gives, which just says that you know if the if the. N, If the count of the number of times we tried this action and the state in from this state is less than some threshold capital n sub e, which is a sum threshold. We just pick it. Maybe it’s 10 Maybe it’s a hundred. The idea is, we just pick a reasonable number that says that we’re gonna do something we’re gonna trust our estimate Only once we’ve seen Capital N examples of this state, this state and action pair, okay, and then otherwise, or rather, if it’s under that threshold if we haven’t seen enough samples, we return our plus, which is sum over estimate of reward we, you know as the. Ai engineer. You would basically choose a reward that overestimates whatever possible rewards you would expect to get, and then otherwise you would return the actual you value. The actual value of the sort of bellman expected reward so another variant of an active reinforcement. Learning algorithm is active. Tt’d learning and and sort of the weird thing about This. Is that the TD value update or the TD? Learning update is exactly the same as the non active. TD learning update and it, basically this this happens because it doesn’t really matter what actions was taken. We sort of the the difference. Is that now? Instead of computing, the expected utility for a fixed policy were now assuming the original tribe, the original bellmen utility, which is the the expected reward for the best policy. So that’s annotated on the screen Because I’ve dropped the Pi notation. So the Pi superscript is no longer in this equation, but the update basically just says, you know, the same thing is that we assume that the utility is an estimate of the reward, Plus the future reward assume we’re going to play optimally and we attempted to play optimally now so now. We know that this should be a good sample this. Should you know what we observe here? Here is should be a good sample of what the future rewards gonna be. And if it’s bigger than our current estimate, we’re gonna increase the current estimate if it’s smaller than our current will decrease my current estimate, and in addition to just running these updates because which are really straightforward now we also still need to keep estimates of the transition probabilities, so we sort of lose some of the benefits of TD learning in that ol TD learning. Is, you know one of the magical things about it? Is that the equation updates or the utility updates? All these equations don’t include the transition probabilities at all because all we’re doing is we’re sort of letting the environment give us these transition probabilities, but we never keep track of them. We never record them. We never even think about them and our updates. But if we want to have real policies in the end, we still need or if we want to use these utilities to compute our policies, we need to estimates of the transition probabilities. Okay, but we can get around that, and the way we do that is by computing, something called an action utility function and this is a function that combines the transition probabilities with utilities so rather than having a value for every single state. Right, we have you had utilities for every single state, and then we had transition probabilities for actions within the States. Now we’re just gonna have a score that takes a cue. This is a function as an action until utility function, it’s traditionally called a cue function. I’m still looking into why I’m not really sure. So I apologize for you. Having to memorize what a cue function is and what q-learning is, but but it’s it’s a cue function of a state and in action so now it takes in a state and action, not just a state, but both things and it returns a score and this the bellman equation for Q functions for action utility functions is on the screen now and you can see. It looks very similar to the bellman equation for for utilities, but the difference is now essentially that the the action decision is now sort of pushed into the into the expectation and another way of defining. Q Functions is that the utility of State S is defined as the maximum over all possible actions of the Q functions for State S right Q functions for the actions from State S. And the intuition here is that you know rather than having, you know. Uh, just the utility of a state, which is the which is the expected reward, given optimal play from this point on the Q function models. What your future rewards will be given that you take some action a from status, and it’s like it’s a little bit like unrolling a level of recursion and what that gets us is that we can write a very similar. TD update to what we’ve seen so far. So this is the TD learning update for Q learning, and it looks pretty much the same, right, it’s a it’s again. You know you you add or subtract some amount. Based on whether an estimate is close to or whether a sample is close to your current estimate and just for convenience. Let’s look at this is, uh, the previous or the utility? TD learning update for comparison. And it’s kind of the same thing, right, So you know the R of S + u Gamma X Max over a prime of Q. Lavoie, that thing is the sample of what you observe that that’s that’s gonna be what happens when we make the transition from S to S prime, and then we compare that to our current estimate of Q of S of a and and if that’s too big, we increased Q of S of a Q of S. And, hey, and if it’s too small, we decrease, so it’s pretty much the same concept, right, it’s just again, you know, sort of making these updates each of each transition, we see, and and then further if we think about, you know now that we have these cute functions, we no longer need to even think about the transition probabilities at all right originally, we would always have to think about them when we go to compute our policy. Maybe we sometimes we don’t need to think about them. When we compute the utility, but when you go to compute the policy, we have to think about what the expected value of the paw of the utility is going to be given the transition probabilities, but but with a Q function, all we care about is maximizing the Q function, so we just have to look at the Q function for the current state and choose the action a that maximizes that Q function, so for that reason, Q functions or Q learning is known as model free reinforcement learning, right, there’s no model of the world or or rather. It’s, you know, it’s sort of hidden away in the Q function, and then finally the last point. I wanted to make in this video. Is that this all? This translates directly to the concept of an approximate utility function that. I drew on the board in class. So you can do the same thing with Q. Learning you just need to define feature functions for you know, some representation for the current state and some action so up top you have the, you know. Here’s a version of the of an approximate q-learning Q function, which is a linear function. Where are these coefficients theta? And you have the features? F and the update equation for this for TD learning with Q. Learning is the second line, and that’s the general form. That’s the general form for arbitrary G functions, and and so in that general form, you have this this partial derivative, which goes slightly beyond. You know, the the prerequisites for this course, but mostly you should be familiar with this or at least. Maybe you can remember it from. Let’s say high school when you studied calculus, but for the purposes of you know, simplifying if we think about just the linear form of G so like like I wrote up on the first line, If the if the G function is just a linear function, then then this gradient or this this derivative rather is this partial derivative is just the function value. So this is the update and this is the update you’re gonna use in your homework for computing. An approximate q-learning agent that’s going to be able to play pac-man, even though the Pac-man state space is really large So so let’s just summarize so the. I so what we covered in this video are general concepts surrounding active reinforcement learning, and I talked about active, adaptive dynamic programming where we choose random actions or we act greedy and we overestimate the utility. Both these are reasonable things to try and they both work fairly well and have some guarantees. Then we talked about active. TD learning, which turns out to be exactly the same as standard TD learning, which is kind of a mind bender, but it comes from the fact that in TD learning, you don’t really care. You know what, what action you took you? Just you just sort of assume that the action came from the optimal policy or the current policy. If we do if you’re doing a fixed policy. TD learning, And and all you, all you assume, is that the the transition that you observe is a representative sample of the future rewards. Okay, and then we talked about cue learning, which is where we learn an action reward function directly rather than thinking about the utilities and the transition probabilities as separate things. We treat it all as one function, and then we can use. TD learning to directly learn this without having to think about the transition probabilities. And, Lastly, I finished up with just the idea that you can use an approximation and the approximation or the update for TD. Learning using the approximation uses the derivative of the utility function. And that’s very easy if it’s a linear function. Because then you just take the feature value along that dimension, and that’s your that’s the derivative. Okay, So let’s talk about this stuff in class. There’s a lot to go through and lot to digest. So see you in class.