Transcript:

What’s up, guys? Welcome back to this series on reinforcement learning in this video, we’re going to discuss. Markov, decision, processes or MDPS. This topic will lay the bedrock for our understanding of reinforcement learning. So let’s get to it. [MUSIC] Markov decision processes give us a way to formalize sequential decision making this formalization is the basis for problems that are solved with reinforcement learning to kick things off. Let’s discuss the components involved in an NDP in an MDP. We have a decision maker called an agent that interacts with the environment that it’s placed in these interactions occur sequentially over time at each time step, the agent will get some representation of the environment. State and given this representation that agent selects an action to take the environment is then transitioned into some new state and the agent is given a reward as a consequence of its previous action so to summarize, the components of an MDP include the environment, the agent all the possible states of the environment, all the actions that the agent can take in the environment and all the rewards that the agent can receive from taking actions in the environment. This process of selecting an action from a given state transitioning to a new state and receiving a reward happens sequentially over and over again, which creates something called a trajectory that shows the sequence of state actions and rewards throughout the process. It’s the agent’s goal to maximize the total amount of rewards that it receives from taking actions and given states of the environment, This means that the agent wants to maximize not just the immediate reward, but the cumulative rewards that it will receive over time. Alright, let’s get a bit mathy and represent an MDP with mathematical notation. This will make things easier for us going forward, so we’re now going to repeat what we just usually discussed, but in a more formal and mathematically notated way in an MDP, we have a set of state’s big S A set of actions big a and a set of rewards Big R will assume that each of these sets has a finite number of elements at each time step T that agent receives some representation of the environments state S sub T based on this state. The agent selects an action, a sub T and together. This state in this action gives us the state action Pair s T comma. A tee time is then incremented to the next time step T plus 1 and the environment is transitioned into a new state represented by S sub T plus 1 At this time? The agent receives a numerical reward. R T plus 1 from the action taken from the previous state. So generally, we can kind of think of this process of receiving a reward as an arbitrary function that maps state action pairs to rewards the trajectory representing the sequential process of selecting an action from a state and then transitioning to a new state and receiving a reward can be represented like this. This diagram and nicely illustrates this entire idea. Let’s break this diagram down into steps. Step 1 at time. T the environment is in State S T Step 2 The agent observes the current state and selects action A T Step 3 the environment transitions to state S T plus 1 and grants. The agent reward R T plus 1 This process then starts over for the next time Step T plus 1 now since the set of states and the set of rewards are finite. The random variables are T and S T that represent the reward in the state at Time. T have well-defined probability distributions. In other words, all the possible values that can be assigned to RT and S T have some Associated probability these distributions depend on the preceding state in action that occurred in the previous time. Step T minus 1 So, for example, suppose S prime is a state within the set of all states and R is a reward within the set of all rewards. Then there is some probability that the state at time T will be S Prime and that the reward at time T will be R. This probability is determined by the particular values of the preceding state and preceding action. We have a bit more formal details regarding transition probabilities on the corresponding blog for this video on deep lizard comm. So be sure to check that out. Alright, we now have a formal way to model sequential decision. Making how do you feel about? Markov decision processes so far. Some of this may take a bit of time to sink in, but if you can understand the relationship between the agent and the environment and how they interact with each other over time, then you’re off to a great start. It’s a good idea to utilize the blog for this video to get more familiar with the mathematical notation because we’ll be seeing it a lot in future videos and while you’re at it, check out the Deep lizard hivemind for exclusive perks and rewards like we discussed earlier. Mdp’s are the bedrock for reinforcement learning, so make sure to get comfortable with what we covered here and next time we’ll build on the concept of cumulative rewards that we introduced earlier, thanks for contributing to collective intelligence and alcea in the next one. Do they have detection? The human is the question they finally killed Quick finger. Those the counterplay comes in from open A I, in fact, jumping baton with the silence, he’s caught the trickster trading theyll. Get the kill, theyll! Take down the liar! Arguably, the pesky hero that open a I have presented, he has been making a lot of plays, but now finally, team human are able to kill him on [Music].