Transcript:

Stalk about Markov decision, processes or mdps, an MDP is meant to capture the state of the world. And how you know an agent behave. One not behaves in it by donating gets rewards in that world. So let’s say, you know we start with the circle. We say this is S 1 so S 1 is a state of the world, which includes the agent in it. So for example, if if your world is two-dimensional Maze with a robot in it, the state of the world might be would be where the agent is in that maze is particularly location. And then you know, maybe the heading of the agent, or you know, if there’s some other stuff in the world, you have to add it in, so it needs to capture everything that’s important about the world and so in each stage. AJ might get a reward and so they say the aging is theorem in state is 1 Then let’s say if it takes action a which is say, move north for a robot or something. Uh, he ends up in his status. – he goes again, a reward of zero and he goes back. Here say it takes a again a and so when state S 3 and say now finally, he gets a reward of 5 so this is a this is an MDP, right. Well, getting there. Another thing that you can have in an. Mdp is a so right here. You know, s1 we take action A. We definitely end up in s2 you can say well. You know, really most of the time or 90% of the time. Here’s a penis -, but we’re gonna say that, you know, with 0.1 or 10% of the time and as one he and so and they take section ans one ends up. Bunyan s4 I guess again 0 and so we can do that. We can have probabilities associated with the actions now. I think is, of course. These have to add up to 1 right, so if you are in any state, you take for any action, you know, you have to have arrows coming out of it Whose probabilities add up to 1 point 9 plus 0.1 one. And, of course, you can take different actions, so let’s say he takes another action. Be from that same state probably 0.8 and Suppan Stuh is 5 where he gets a reward of two there. You go to do and then we’re going to need a point. Two probability. Let’s say be the probability point two. He has stays at the same state. You know, that’s fine. You can have circles. Ah, you can add more stuff here so. I’d say they take a here. He definitely in suffice for that kind of stuff. I mean, it’s not finished, but so you get the idea. Basically an NM. DP is going to be consist of these states of the worlds these rewards and these actions and the transition. The way we would represent this. So this is nice picture is a nice. I think intuitive view of the whole thing, but you know, when you’re writing papers, you know, you need to use, you know, represent this using more succinct notation or when you’re writing a program and really use a transition function, so we say T of a so a prime is some probability, so that is a transition function. Basically, It’s just a table as a LOF and then the peep. So and let me just do this a couple of these up. If you are in state as one and we take action a then with probability 0.9 we Dubba s2 right if we are in s1 we take action A with probability 0.1 we’re going to end up at s4 That’s that’s a 4 right there. High bar and S 1 we take action. B, we probably do point a to window. S5 that way. You also have to add all the zeros to to have your table completed, Oh! I mess this up. Sorry, yeah, so. I’m putting the probabilities here and the states here. So here I stay as one you take action, and you know, we probably zero you at that, but say, yes, three, because you gotta have all the s primes there, too, so a lot of these probabilities will be zero, and normally. Yes, you would put the probability at the end. Ah, but yeah, you can see. This is just one big, humongous table in most, you know, in any even the toy application, the set of states really explode, so a lot of the programming and that you need to do. Is you’re going to use these to actually represent something? Is you need to find ways of limiting and reducing the set of states to something more manageable. That’s only in the millions, not in the trillions because you can imagine, you know, for example, if you have a two-dimensional grid, right that you have one, just one guy in a two-dimensional grid and that’s n by N Well, that right, There is a square States. You have two guys now, then is N Square times. N squared and then you got three guys in squared X squared X squared. So the more agents you have the explodes really fast, okay, So these are transition probabilities that represent most of it. You also have these? You know the rewards, so you have a bar of S function, which is just a number and that is going to give you the reward. So between these two and the set of states and the set of actions You have defined your MVP. So that is these are what define the whole MVP. You know, picture of it, okay. So minimize are going to erase this and let’s talk about utility, so I show you this just now and so the 0 here 0 here and 5 here and he was getting one down here. So look at the question is so these are the rewards which are like utilities, but you know, the question is what is says as one. What is the utility of state as one are less able to three four? So what is the utility of s4 It seems now. This one seems pretty clear for s4 is one because you get a reward of one, but you know, if if you had this like you take action, a keeps getting rewarded. One one one one one that seems more than one, right, meanwhile, aura, but let’s say you don’t have that. So in that case, it’s just going to be one, but then we’re back here and let’s say now try to figure out what’s the utility of state as one well in s1 I’m getting zero. So is that the utility O is one. But you know if I am here and I just took action a I could. I could right away, end up here and get one or I could take BB and get five, so it doesn’t seem right to say that. The utility of s1 is 0 right because yes, right now. I’m getting 0 but you know, I’m just right next to the possibility of getting 1 or 5 So how do we? How do we set the utility of s1 Well, basically, you know, we’re saying that, yeah? I mean, if you know this reward is. If you don’t care about the future, right if you’re not going to take any more action, you know your agent’s going to die right here, then. Yeah, utility, that is 0 because you’re never going to see that, but if you do a plan to take another action, and and you do, you know, value future earnings? Someone then, yeah. This is not so bad, so what we’re going to use is use this gamma. We call this the discount factor of rewards, so we’re going to say that you know a reward now, so if I get the reward now is worth, you know our to get a reward over are now let’s say 0 or up. Sorry 1 then this words 1 to me, but a reward our, you know the next time step, so we’re gonna use discrete time and so at time. T plus 1 If I get a reward, Our is where the gamma times are so gamma is going to be a number, You know, between 0 & 1 right, so this is gonna be less than 1 and if I get a reward, two time steps away. So it is one time step in the future. Two times so in the future is gonna be gamma squared times R so in cube. So, for example, let’s say gamma is 0.9 so on the reward is 5 So if I get a reward of 5 now, that’s worth 5 to me if I get a reward of 5 next time, that’s going to be 0.9 times five, and if I go after that is going to be 0.9 squared, which is 0.8 E 1 five. All right, and so you can see, you know, and I put this plus signs here because you’re going to see that this is sort of telling me that you know, if if I know I’m in the state right now where I get five and then I can take an action. That’s going to give me five after that, and then I can take an action. That’s going to give me five again after that. So you know something like this, then you know? I know that the utility for this particular state I can say now. You’d remember utility. I can send a utility for. This date is five, but now plus. Oh, point nine, five, nine pause. Yada, yada. Yada, up to infinity. You know this number, Of course, you know, as you notice keeps getting smaller and smaller, so this is going to converge. I think, and that will give us the utility of this state, right, so that’s how we change from these rewards values reward to utility and, okay, so to, you know, come right around and finish yourself. We can say then, you know, once we have a total definition of our problem. Change colors. Here we can say that the utility of every state right different from the reward is going to be the reward. I get in that state plus gamma times the maximum action, the best action I can take, and then I’m going to add up over all S states as prime. The probability that I can go from s given them as I took action a and up in s prime times, the utility of that S Prime. So this is how we define. This is a recursive definition, and this is how we’re just we can define the utility of every state so the utility of every state is the reward. I get in that state plus gamma times, and then the tricky part is I have to consider if I am in that states. Let’s say I’m in state as one I can take either actions a or B right so I can take either. A or B, so I’m going to this, Max. So if I take a, let’s say, I take action A I end up is 4 right, and with probability 1 so the g/t of sts-1 of a s 4 You know that’s 1 and then the utility I get for its status for well. I don’t know what that is right now. Because I’ll have a little reward, but if you think about it, you know, you see well. There’s no way out of S 4 and the utility of s is R of S Plus this, but, you know, since there’s no way out of it out of s for this whole thing is going to be 0 so the utility of S 4 is just the reward of X 4 which is 1 that’s 1 so it’s going to be 1 times 1 which is 1 so if I as one the this whole thing here, this whole sum evaluates to 1 If I am in S 1 and I take B a similar thing. Let’s go over here, and now that’s a little bit harder, because now I have to calculate this utility in this case. Do you tell you this guy depend on that so and you can see that it’s going to be a little bit tricky to calculate because we have this recursive definition of U of S, right, and which depends on U of S. It’s going to be a bit tricky to solve this and find all the U of s. S and so we’ll talk about that on the next lecture.