Hi, everyone previously in our Youtube channel. You should have watched my two videos on unsupervised and supervised learning with Matlab. Also, we recently added a new video on deep learning with Matlab. If you haven’t checked them yet, please check the links that you find below in the description of this video in this short video. I’m going to discuss an easy example on reinforcement. Learning reinforcement learning is what we miss to talk about in my seminars on machine learning and is another approach to solve problems that typically comes from control theory to explain all the things that I want to tell you. Today, let’s dive into the code. I’m Faden from the University of Pisa. And this is my short video. In this short life script, you will have all the information needed to understand the code that you will see throughout this tutorial. So reinforcement learning is a method in between supervised and unsupervised learning, which has the aim to train the controller of a system, for example, or an algorithm to play video games Like chess, go and tarry games. It is based on an agent which performs some action in the environment as a function of its current state, and it receives Some feedback signal called re words to evaluate the state that is going to reach the aim of our agent is to learn to perform the action that maximize what is called the total reward, which is the collection of all the feedback signals that are obtained during the exploration of the environment by the agent. The idea is that we have some episodes, which are time slots in which we run our agent to explore the environment for each episode. The goal of our agent is to maximize the total future reward as we have said to do, so, we have to explore the environment and build some knowledge about the consequences that each action will produce for that particular state. This can be done by learning what is called a quality value. Q, which represent the amount of knowledge that we are collecting during the exploration of the environment among golde action, It is clear that the best action that we can perform in the state St. Is always to take the maximum among the actions that maximizes the quality value, but this will lead in a deterministic behavior in which for every state we already know that we have to perform that specific action and we don’t want this, especially if our agent is not trained yet so to explore, we need some probabilistic actions that means in the RL framework. We say that we need a probabilistic policy. The policy is the rule used by the agent to choose an action a from a state s. Epsilon greedy policy states that with probability Epsilon, we choose a random action instead with probably one minus Epsilon. We just follow the best action that we can perform. It is also called the greedy policy one way to learn the quality value that we need is to use a temporal difference. Method called the Bellman’s equation. So remember, we have three parameters. One is the learning rate Alpha. Then we have a parameter gamma to wait the future rewards. Then we have a parameter epsilon that is used to balance from exploration and exploitation. In particular, this last concept is called the trade-off between exploration and exploitation that is found in reinforcement learning, but also in other computer science fields like in evolutionary algorithms. In this short video, we are going to discuss the realization from scratch of an algorithm to solve the grid world, which is an environment in which we have an agent that has to reach a certain gold traveling along a squared environment and has to avoid an enemy. The code provided in this live script takes inspiration from a Youtube video on reinforcement learning with Python, which is last year in the live script, but also in the description of the video. The code starts with the setting of some constant parameters, like the number of episodes, some values used to set penalties and rewards in particular, In particular, we will set penalties every time we are doing a movement to first the agent to speed up and also we will assign a penalty to the agent every time we hit the enemy. Instead, we will give it every word every time the agent is able to reach the goal among other parameters that we discuss importance. It is important to set the learning rate the discount factor, but also even though here is not expressed, This is also important to set the absolute value, which balance the exploration and the exploitation during the training procedure and the decay of the epsilon value so typically in reinforcement algorithm. What happened is that we start with a larger value of Epsilon, meaning that at the beginning of the training, we tend to explore more to take more random actions and then going toward the end of the training process. We will start after. I’ve been collecting enough knowledge. We will start to exploit more our knowledge and perform probabilistic actions. We are going to develop a Q learning algorithm Q. Learning requires the creation and the update, often of a table called Q Table, which has as the entries each possible combination of the state and the action that we can perform in that specific state. And for each of these combination, we have a value. Q which reflects the quality of that specific action in that specific State. In this code, you you can either start the training with a random. Q table or you can load an existing one and playing with and play with the agent Before starting, we have to build some auxiliary function used to create our agent create our environments and let our agent to move inside the environment. All of these auxiliary function are at the end of this code in a section called auxiliary functions and we are not going to discuss them during this short video. But you can check it if you want and you can play with them. Taking inspiration from the original code from the original Python code, we called every entity that we can put in our environment such as the player, such as the goal and the failure. We call them blobs, so essentially we will have a function called initialize blob, which is used to create one of these entities inside our environment, our environment since we are in a grid world is just a kind of occupancy grid, which is a matrix which can take four kinds of values, depending if no entity is present in that point of the grid or it can takes value 1 2 three, depending if we are considering the player in the point, the goal or the failure in this brief section of the code initialize our environment, creating matrix of a square matrix of zeros, and then we start adding the blobs, which are the player first, then an enemy and then one or two goals after creating our blobs in the environment. As I told you, we have the possibility to display our grid world by knowing that since this is a matrix, we can choose for each of the kind of club that we have in the environment, a color value so we can customize a color map to display a matrix in such a way that we will have a nice visualization of our environment. This is done in the auxiliary function display grid and this is the outcome, so we spawned two third blob or two gold blobs here. We have also spawned a failure blob and a player blob. Also, we have defined as a linear function, a function that allows us to move in the environment. This is very useful because we can use this function every time, just calling that specific action for that specific blob and updating updating the environment accordingly to the action that we take. I’d like to mention that even though this function can be used for any kind of blob like for the player blob for the failure blob and for the blob, however, during the training procedure for this code, we will assume that in each episode, the failure blob and the goal blob will be fixed will be kept fixed while just the player blob will move trying to find the best path to reach the goal Blob. However, these functions allows you by running many times this section to play and move the player blob in the environment. But just if you change player with food or with enemy, you can move. Also, the other blobs in the environment here is a brief example. We try to move, for instance, this on the left. This is the result we try to move this hop, and this is the result and you can also run it again to go up instead of change the control by just running this section, and this is the result, and you can keep moving up moving down and left and so on then this function is used. This is another auxiliary function used to clean the world to clean the environment. Now we move to the main code, which is the implementation of a queue learning algorithm to solve our grid world environment. This is based as I told you. The implementation of the Bellman’s equation to update to estimate the values of Q that is our quality function in order to build some knowledge on the environment and help our agent to find the best route, the best path from where its bounds towards the goal, avoiding also the the failure blob. So the code allows use the possible to run an existing cute table or to start randomly a new one. Maybe because you want to train your agent and you have the possibility, of course to train your agent by setting up one. A train to true or false here is at the beginning, so here we initially start that we want a new cue table, and we want to train our agent. Then for each episode For each episode, we will spawn three blobs. We will spawn the player, the goal and the failure and for each of them we will first. [MUSIC] And for each of them we will, we will define maximum length of our episode, meaning that we don’t want that an episode last too much So 200 is the number of steps allowed inside of H episode. Then we will get an observation. Which is we want to? We want to evaluate the state in which we are and in this code. The state of the agent is encoded as the relative difference between the player and the goal and the player and the failure. This size is just use here in order to balance fact that so for each episode, we get an observation, which is we understand in which state we are with our agent. Each state is encoded as the relative difference between the player and the goal and the player and the failure. Then we take the Epsilon. We we choose an action, according to the Epsilon Brady policy. So if a random value that we select is greater than Epsilon, we take the greedy, we just follow the greedy policy that means we check which entry off the cue table for that specific state has the maximum, so we select the action accordingly or instead we just choose a random action, then we take this action, calling the function the auxiliary function action, and then we have the reward assignment as I told you at the beginning. The reward assignment depends if we are just doing a normal step, so we are moving from we’re moving the player in an empty space in the gray, and in that case, we will just assign to him a small penalty. We can assign a penalty. Also if the player reached the failure state, they’ve reached the failure point in the grade and we can assign a reward if the player is able to reach the goal. Then after the assignment of the reward, we have a new observation so performing this action of the State S will lead the agent to a state, a new state S and for this we have to compute the maximum future value of Q by considering the maximum future value of Q obtained based on the new observation and together with the current value of Q or in a certain sense of the previous value of Q. We can update our Q value in the Q Table, According to the Bauman’s equation, so this is just implementation of the Bellman’s equation balance equation. It is used whenever we don’t reach the goal. Whenever we reach the goal, we just set the value of. Q has the food reward. This is! This is essentially everything the episode keep going, keep going unless we reach the goal state the the goal point or the enemy point and for in these cases, the episode stops and we just collect the reward of the episode and we scale the value of Epsilon with the Epsilon decay parameter. And then we clean the world so at each episode. The things we have to do is to start a new environment, spawning the blobs after having spawned the blob. We have to set the episode, which I have a maximum length that can last during the episode. We have a loop of things that happen that are checking the state where we are choosing an action. According to the Absalom, really policy, performing the action, assigning the reward after having assigned the reward, checking the new state evaluating the Q in this new state, evaluating the Q value in the previous state and then using the bellman equation to update the value of Q in the new state. And this keep going, we can visualize this by asking to plot an episode during the training, and this is better to be it’s better to perform this. Not at every episode. Otherwise, you will end up in a lot of plots and this is also very time-consuming. Another interesting feature that can be visualized is the episode three words with this plot along the training procedure, so we are now going to run our code and see what happened at the end of the Train. These are some examples of episodes, and you can see that the green blob is overlapped with the blue one, meaning that our agent was able to retrieve the correct path in these examples and you can display the average reword over time to see if the agent is really learning absurd that here, we are displaying the moving average of the episode three world because the episode, every word is a very noisy signal, and this is due to the fact that the algorithm during the exploration, especially during the first steps, tends to achieve a very random value of the of the total reward. You cannot you have also the possibility in the last section of the code to play with the trained agent. This is everything from my side. You will find the code in a link in the description below and again, check our videos on Youtube. And thank you very much, guys.