WHT

Pytorch Reinforcement Learning | Teach Ai To Play Snake – Reinforcement Learning Tutorial With Pytorch And Pygame (part 1)

Python Engineer

Subscribe Here

Likes

398

Views

13,190

Teach Ai To Play Snake - Reinforcement Learning Tutorial With Pytorch And Pygame (part 1)

Transcript:

[MUSIC] Hey, guy’s today! I have a very exciting project for you. We are going to build an AI That teaches itself how to play snake, and we will build everything from scratch, so we start by creating the game with Pygame and then we build an agent and a deep learning algorithm with pie torch. I will also teach you the basics of reinforcement learning that we need to understand how all of this works, so I think this is going to be pretty cool, and now before we start. Let me show you the final project, so I can start the script by saying Python agents Dot Pi. Now this will start training our agent and here we see our game and then here I also plot the scores and then the average score and now let me also start a stopwatch so that you can see that all of this is happening live, and now at this point. Our snake knows absolutely nothing about the game. It only is aware of the environment and tries to make some more or less random moves, but with each move and especially with each game, it learns more and more and then knows how to play the game and it should get better and better, so the first few games. You won’t see a lot of improvements, but don’t worry, that’s absolutely normal. I can tell you that it takes around 80 to 100 games until our AI has a good game strategy and this will take around 10 minutes. Also, you don’t need a GPU for this. So all of this training can happen on the CPU. That’s totally fine. Okay, so let me speed this up a little. [MUSIC] bit [Music] all right, so now about 10 minutes have passed and we are at about Game 90 I guess, and now we can clearly see that. Our snake knows what it should do, so it’s more or less going straight for the food and tries not to hit the boundaries, so it’s not perfect at this point, but we can see that it’s getting better and better, so we also see that the average score here is increasing, and now the per the best score so far is 25 and to be honest for me. This is super exciting. So if you imagine that at the beginning, our snake didn’t know anything about the game and now with a little bit of math behind the scenes, it’s clearly following a strategy. So this is just super cool, don’t. You think all right, so let me speed this up a little bit more? All right, so after 12 minutes? Our snake is getting better and better, so I think you can clearly see that our algorithm works. So now let me stop this, and then let’s start with the theory, So I will split the series into four parts. In this first video, we learn a little bit about the theory of reinforcement learning. In the second part, we implement the actual game or also called the environment here with Pygame. Then we implement the agent, so I will tell you what this means. In a second and in the last part, we implement the actual model with Pytorch, so let’s start with a little bit of theory about reinforcement learning. So this is the definition from Wikipedia, so reinforcement Learning is an area of machine. Learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. So this might sound a little bit complicated. So in other words, we can also say that reinforcement learning is teaching a software agent how to behave in an environment by telling it. How good it’s doing, so what we should remember here is that we have an agent, so that’s basically our computer player. Then we have an environment, so this is our game in this case, and then we give the agent a reward, so with this, we tell it how good it’s doing and then based on their reward, it should try to find the best next action. So yeah, that’s reinforcement learning and to train the agent. There are a lot of different approaches, and not all of them involve deep learning, but in our case, we use deep learning, and this is also called steep Q learning, so this approach extends reinforcement learning by using a deep neural network to predict the actions. And that’s we’re going to use in this tutorial. All right, so let me show you the rough overview of how I organized the code. So, as I said, we’re having four parts. So in the next part, we implement the game with Pygame, then we implement the agent, and then we implement the model with Pytorch, so our game has to be designed such that We have a game loop and then with each game loop, we do a play step that gets an action and then it does a step, so it moves the snake and then after the move, it returns the current reward, and if we are game over or not and then also the current score, then we have the agent and the agent basically puts everything together, so that’s why it must know about the game and it also knows about the model, so we store both both of them in our agent, and then we implement the training loop, so this is roughly what we have to do so based on the game, we have to calculate a state and then based on the state, we, um, calculate the next action and this involves calling model predict, and then with this new action, we do a next play step, and then, as I said, we get a reward, the game overstate and the score and now with this information, we calculate a new state and then we remember all of this, so we store the new state and the old state and the game over state and the score and with this, we then train our model so for the model, I call this linear Q net, so this is not too complicated. This is just a feed forward neural net with a few linear layers, and it needs to have the these information, so the new state and the old state. And then we can train the model and we can call model predict. And then this gets us the next action, so yeah. This is a rough overview how the code should look like, and now let’s talk about some of those variables in more detail, for example, the action or the state or the reward. So let’s start with the reward, so that’s pretty easy, so whenever our snake eats a food, we give it a plus 10 reward when we are game over, so when we die, then we get -10 and for everything else, we just stay at zero, so that’s pretty simple. Then we have the action, so the action determines our next move so we could think that we have four different actions so left right up and down, but if we design it like this then, for example, what can happen is if we go right then we might take the action left, and then we immediately die so this is basically a 180 degree. Turn so we don’t allow that. So a better approach to design. The action is to only use three different numbers, and now this is dependent on the current direction. So, um, one zero zero means we stay in the current direction, so we go straight, so this means. If we go, right, then we stay right if we go left. Then we go left and so on. Then if we have 0 1 0 this means we do a right turn and again. This depends on the current direction. So if we go right and do a right turn, then we go down next if we go down and do a right turn again, then we go left and then again we would go up, so this is the right turn, and the left turn is the other way around, so if we go left and do a left turn, then we go down and so on. So with this approach, we cannot do a 180 degree turn, and we also, we only have to predict three different states, so this will make it a little bit easier for our model. So now we have the reward and the action, then we also need to calculate the state and the state means that we have to tell our snake some information about the game that it knows about, so it needs to know about the environment and in this case, our state has 11 values, so it has the information if the danger is straight or if it’s ahead if the danger is right or if the danger is left, then it tests, um, the current direction so direction left right up and down and then it has the information if the food is left or right or up or down and all of these are boolean values. So let me show you an actual example, so in this case, if we are going, right and our food is here, then we see, um, danger straight, right and left. None of this is true, so, for example, if our snake is over here at this end, and it’s still going, right, then danger Straight would be a one, so this again also depends on the current direction, for example. If we move up at this corner here, then danger right would be a one, then for these directions. Only one of them is one and the rest is always zero. So in this case, we have danger right set to one and then for this in our case. Our food is right of the snake and also down of the snake. So food, right is one and food down is one all right, so now with the state and the action, we can design our model. So this is just a feed forward neural net with an input layer, a hidden layer and an output layer and for the input, it gets the state. So as I said, we have 11 different numbers in our state 11 different boolean values, zero or one, so we need the size 11 at the beginning, then we can choose a hidden size and for the output, we need three outputs, because then we predict the action so these can be some numbers and these don’t need to be probabilities so here we can have raw numbers, and then we simply choose the maximum so for example, if we take 1 0 0 and if we go back, then we see this would be the action straight, so keep the current direction, so yeah, that’s how our model looks like, And, of course, now we have to train the model, so for this, lets. Talk a little bit about this deep Q. Learning so Q stands for this is the Q value and this stands for the quality of the action. So this is what we want to improve, so each actions should improve the quality of the snake, so we start by initializing the Q value. So in this case, we initialize our model with some random parameters. Then we choose an action by calling model predict state, and we also sometimes choose a random move, so we do this, especially at the beginning when we don’t know a lot about the game yet so and then later, we have to do a trade-off when we don’t want to do a random move anymore and only call model predict, and this is also called a trade-off between exploration and exploitation, so this will get clearer later when we do the actual coding so then with this new action, we perform this action, so we perform the next move, and then we measure the reward and with this information, we can update our Q value and then train the model and then we repeat this step, so this is an iterative training loop so now to train the model as always, we need to have some kind of loss function that we want to optimize or minimize, so for the loss function, we have to look at a little bit of math and for this. I want to present you. The so-called Belmont equation. So this might look scary. So don’t be scared here. I will explain everything and actually it’s not that difficult when we understand this and then code this later. So what we want to do here? We need to update the Q value as I said here. So according to the Belmont equation, the new Q value is calculated like this, so we have the current Q value, plus the learning rate, and then we have the reward for taking that action at that state, plus a gamma parameter, which is called this count rate. So don’t worry about this. I will also show this later in the code again, and then we take the maximum expected future reward, given the new state and all possible actions at that new state. So yeah, this looks scary, but I will simplify that for you, and then it’s actually not that difficult. So the old Q value is model predict with State 0 so if we go back at this overview, so the first time we say get state from the game, this is our state 0 and then after we took this place step, we again measure or calculate the next state. So this is then our state one. So with this information again. Our first queue is sub just model predict with the old state and then the new Q is the reward, plus our gamma value times the maximum value of the Q State so again. This is model predict, but this time we take state one and then with these two information, our loss is simply the Q new minus Q squared and yeah. This is nothing else than the mean squared error, so that’s a very simple error that we should already know about, and then this is what we must use in our optimization. So, yeah, that’s what we are going to use, so we have to implement all of these three classes and in the next video, we start by implementing the game and yeah. I hope you enjoyed this first introduction. And then I see you in the next video bye you.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more