Transcript:

Hi, everybody, welcome to a new. Pi torch tutorial in this video. I’m going to explain the famous back propagation algorithm and how we can calculate gradients with it. I explained the necessary concepts of this technique, and then I will walk you through a concrete example with some numbers and at the end, we will then see how easy it is to apply back propagation in Pi Torch, So let’s start and the first concept we must know is the chain rule, so let’s say we have two operations or two functions. So first we have to input X, and then we apply a function a and get an output Y, and then we use this output as the input for our second function, so the second function B and then we get the final output C and now we want to minimize our C, so we want to know the derivative of C with respect to our X and here in the beginning and we can do this using the so-called chain rule, so for this, we first compute the derivative of C with respect to Y and multiply this with the derivative of Y with respect to X, and then we get the final derivative. We want so first here. We compute the derivative at this position, so the derivative of this output with respect to this input and then here the derivative of this output with respect to this input. And then we multiply them together and get the final gradient. We are interested in so that’s. The chain rule. And now the next concept is the so called computational graph. So for every operation we do with our tenth source. High-touch will create a graph for us. Silver at each node. We apply one operation or one function with some inputs and then get an output so here at this case. In this example, we use a multiplication operations, so we multiply X and Y and then getsy. And now at these notes, we can calculate so-called local gradients and we can use them later in the chain rule to get the final gradient so here, the local gradients, we can compute two gradients, the gradient of C with respect to X, and this is simple since we know this function here. So this is the gradient gradient of X Times Y with respect to X, which is y and here in the bottom, we compute the derivative of X Times Y would respect to. Y which is X. So local gradients are easy because we know this function. And why do we want them? Because typically, our graph has more operations and at the very end, we calculate a loss function that we want to minimize, so we have to calculate the gradient of this loss with respect to our parameter. X in the beginning. And now let’s suppose at this position, we already know the derivative of the loss, with respect to our C and then we can get the final gradient. We want so that with the chain rule, so the gradient of the loss, with respect to X is then the gradient of loss, with respect to C times, our local gradient, so the derivative of C with respect to X, and yeah, this is how we get the final gradient then, and now the whole concept consists of three steps so first we do a forward pass where we apply all the functions and compute the loss, then at each node, we calculate the local gradients and then we do a so-called backward pass where we compute the gradient of the loss, with respect to our weights or parameters using the chain rule. So these are the three steps we’re gonna do, and now we look at a concrete example so here we want to use linear regression. And if you don’t know how this works then. I highly recommend my machine learning from scratch tutorial about linear regression. I will put the link in the description So basically, we model our output with a linear combination of some weights and an input. So our Y hat or Y predicted is W Times X, and then we formulate some loss function. So in this case, this is the squared error. Actually, it should be the mean squared error, but for simplicity, we just use the squared error. Otherwise, you would have another operation to get the mean. So the loss is the difference of the predicted. Y minus the actual Y. And then we square it. And now we want to minimize our loss, so we want to know the derivative of the loss, with respect to our weights. And how do we get that? So we apply our three steps first. We do a forward pass and put in the X and the W and then here we put in the Y and apply our functions here, and then we get the loss. Then we calculate the group the local gradients at each node so here, the gradient of the loss, with respect to our s then here at the gradient of the s with respect to our Y hat and here at this node, the gradient of Y hat with respect to our W and then we do a backward pass, so we start at the end and here we have the first we have the derivative of the loss, with respect to our s and then we use them and we also use the chain rule to get the derivative of the loss with respect to of the Y hat, and then again we use this and the chain rule to get the final grade of the loss with respect to our W. So let’s do this with some concrete numbers, So let’s say we have X and Y is given so X is 1 and Y is 2 in the beginning, and so these are our training samples and we initialize our weight, So let’s say, for example, we say our. W is 1 in the beginning, and then we do the forward pass so here. At the first node, we multiply X and W so we get Y hat equals 1 then at the next node. We do a subtraction, so Y hat minus y. This one minus 2 equals minus 1 and at the very end, So we square our s. So we have 1/2 s squared, so our loss then is 1 and now we calculate the local gradient so at the last node we have the gradient of the loss with respect to s and this is simple because we know the function. So this is the gradient of s squared, so this is just 2 s and then at the next node, we have the gradient of s with respect to Y hat, which is the gradient of the function Y hat minus y with respect to Y hat, which is just 1 and then here at the last node, we have the derivative of Y hat with respect to W so this is the derivative of W Times X with respect to W which is X and also notice that we don’t need to go don’t need to know the derivatives in this graph lines, so we don’t need to know what is the derivative of s with respect to Y and also here we don’t need this because our X and our Y are fixed, so we are only interested in our parameters that we want to update here and, yeah, and then we do the backward pass so first now we use our local gradients, so we want to compute the derivative of the loss with respect to y hat and here we use the chain rule with our to local gradients that we just computed, which is 2 S Times 1 and S is minus 1 which we calculated up here and then so this is minus 2 and now we use this derivative and also this loka gradient to then get the final gradient the gradient of the loss, with respect to our W, which is the gradient of the loss with respect to y hat times, the gradient of Y hat with respect to W, which is minus 2 times X and X is 1 so the final gradient is minus 2 So this is the final gradient, then that we know want to know, and, yeah, that’s all how back propagation works and let’s jump over to our code and verify that Pi touch. Get these exact numbers, so let’s remember. X is 1 Y is 2 and W is 1 and then our first gradient should be minus 2 so let’s see how we can use this in Pi Torch And first of all we import torch, Of course, then we create our vector. Art ends us. So we say X equals torch dot tens, or and this is 1 and then our Y equals torch dot tens or with 2 and then our initial weight is a tensor, also with 1 so one point zero to make it a float and here in with our weights. We are interested in the gradient, so we need to specify require squat, equals true, and then we do the forward pass and gets and compute the loss, So we simply say y hat equals. W Times X, which is our function. And then we say loss equals y hat, minus the actual Y. And then we square this, so we say this to the power of two, and now let’s print our loss and see this is one in the beginning, and now we want to do the backward pass, so let’s do the backward pass and Pi touch and we’ll compute the local gradients automatically for us and also computes the backward pass automatically for us, so the only thing that we have to call is say loss backward. So this is the whole gradient computation and now our. W has this dot gre’t attribute and we can print this. And now this is the first gradient in the after the first forward and backward pass and remember, this should be minus two in the beginning. And here we see we have eight Enso with minus two, so this is working and the next steps would be, for example. Now we update our weights and then we do the next forward and backward pass and do this for a couple of iterations and, yeah, that’s how back propagation works and how, and also how easy it is to use it in Pi Torch and. I hope you enjoyed this tutorial. Please subscribe to the channel and see you next time bye.