Gradient Clipping Pytorch | Pytorch Tutorial 03 – Gradient Calculation With Autograd

Python Engineer

Subscribe Here





Pytorch Tutorial 03 - Gradient Calculation With Autograd


Hi, everybody, welcome to a new. Pi Torch Tutorial. Today we learn about the autocrat package in PI Torch and how we can calculate gradients with it. Gradients are essential for our model optimizations. So this is a very important concept that we should understand luckily. Pi Touch provides the autograft package, which can do all the computations for us. We just have to know how to use it, so let’s start to see how we can calculate gradients in pie charts so first of all we import torch, of course, and now let’s create a tensor X equals torch dot R and N of size 3 and now let’s print our X. So this is a tensor with three values, so three random values, and now let’s say later, we want to calculate the gradient of some function with respect to X. Then what we have to do is we must specify the argument require Skrutz equals true so by default. This is false and now if you run this again, then we see that also Pi touch tracks that it requires the gradient, and now whenever we do operations with this Tensor, Pi Torch will create a so-called computational graph for us. So now let’s say we do the operation X + 2 and we store this in an output, So we say Y equals X + 2 Then this will create the computational graph and this looks like this, so for each node. We have a for each operation. We have a node with inputs and an output so here. The operation is the + so an addition and our inputs are X + 2 and the output is Y and now with this graph and the technique that is called back propagation, we can then calculate the gradients. I will explain the of backpropagation in detail in the next video, but for now it’s fine to just know that we are how we can use it so first. We do a forward pass so here we apply this operation and in the forward pass, we calculate the output. Y, and since we specified that, it requires the gradient PI touch will then automatically create and store a function for us and this function is then used in the back propagation and to get the gradients so here. Y has an attribute grab underscore F M so this will point to a gradient function and in this case, it’s called at at backward and with this function, we can then calculate the gradients in the so called backward pass, so this will calculate the gradient of Y with respect to X in this case. So now if we print Y then we will see exactly this graph. F N attribute and here this is an at backward function. So because here, our operation was a plus and then our, then we do the back propagation later, so that’s why it’s called at backward and let’s do some more operation with our tensors, so let’s say we have. C equals y times y times 2 for example, So this tensor then also has this great function attribute. So here, grad. F M equals Malbec word because here our operation is a multiplication, and for example, we can say C equals. C dot means so we can apply a mean operation, and then our gradient function is the mean backward, and now, when we want to calculate the gradients, the only thing that we must do is to call seat backward, so this will then calculate the gradient of C with respect to X, so X then has a gradient ret attribute where the gradients are stored. So we can print this, and now if you run this, then we see that we have the gradients here in this tensor. So this is all we have to do, and now let’s have a look what happens when we don’t specify this argument so first of all, if we print our 10 zeros, then we see that they don’t have this great function attribute, and if we try to call the backward function, then this will produce an error, so it says Tensor’s does not require a gret and does not have the great function, so remember that we must specify this argument and then it will work and one thing that we should also know is so in the background. What this basically does this will create a so-called vector Jacobian product to get the gradients, so this will look like this. I will not go into the mathematical details, but we should know that we have the Jacobian matrix with the partial derivatives, and then we multiply this with a gradient vector, and then we will get the final the final gradients that we are interested in so this is also called the chain rule and I will also explain this more in detail in the next video. But yeah, we should know that. Actually, we must multiply it with a vector. So in this case since our C is a scalar value, we don’t have to put the don’t have to use an argument here for our backward function, so our C here has only one value, so this is fine, but let’s say we didn’t apply the mean operation. So now our C has more than one value in it, so it’s also size 1 by 3 and now, when we try to call the backward function like this, then this will produce an error so gret can be implicitly created only for Skala outputs. So in this case, we have to give it the gradient argument, so we have to create a vector of the same size, so let’s say V equals torch dots tensor and here we put, for example, point 1 1 point 0 and point 0 0 1 and we give it a data type of torch dot float32, and then we must pass this vector to our backward function, and now it will work again. So now if we run this then. This is okay, so we should know that in the background. This is a chicken, a vector Jacobian product and a lot of times. The last operation is some operation that will create a scalar value. So this is it’s okay to call it like this without an argument, but if this is not an ask a lot and we must give it the vector and yeah, then some other thing that we should know is how we can prevent. Pi tot from tracking the history and calculating this Gret FM attribute. So for example, sometimes during our training loop. When we want to update our weights. Then this operation should not be part of the gradient computation. So in one of the next tutorials, I will give a concrete example of how we apply this autocrat package and then it will become clearer. Maybe, but yeah, for now, we should know how we can prevent this from from trekking the gradients and we have three option for this. So the first one is to call the requires GRAT underscore function and set this to false. The second option is to call X dot detach so this will create a new tensor that doesn’t require the gradient and the second option would be to wrap this in a with statement. So with Torch Dot no gret. And then we can do our operations, so yeah, let’s try each of these so first we can say. X DOT requires grat underscore and set this to false, so whenever a function has a trailing underscore in Pi Torch. Then this means that it will modify our variable in place. So now if you print X, then we will see that it doesn’t. Have this require grad attribute anymore? So now this is false. So this is the first option and the second option would be to call X detach, so we say y equals X dot detach. So this will create a new vector with the same or a new tensor with the same values, But it doesn’t require the gradient. So here we see that our. Y has the same values but doesn’t require the gradients and the last option is to wrap it in a torch in a width with statement with Torch Dot. No, gret, and then we can do some operations, for example, Y equals X plus 2 and now, if we print our Y then we see that it doesn’t have the gradient function attribute here, so yeah, if you don’t use this and would run it like this, then our why has the gradient function, so these are the three ways how we can stop by touch from creating this gradient functions and tracking the history in our computational graph, and now one more very important thing that we should also know is that whenever we call the backward function, then the gradient for this tensor will be accumulated into the dot grad attribute, so their values will be summed up so here we we must be very careful, so let’s create some dummy training example where we have some have some weights, so this is a tensor with ones in it of size. Let’s say four. And they require the gradient, so require scrud equals true, and now let’s say we have a training loop where we say for epoch in range and first let’s only do one iteration, and here we do. Let’s say model output equals. Let’s say weights times three dots sum, so this is just a dummy operation, which will simulate some model output and then we want to calculate the gradients, so we say model output dot backward, and now we have the gradient so we can call weights Dot Grat and print this, so I want. Gradients here are three, so the tensor is filled with threes. And now if we do another iteration, so if we say we have two iterations, then the second backward call will again accumulate the values and write them into the grad attribute. So now our greps has sixes in it. And now if we do a third iteration, then it has nines in it, so all the values are summed up and now our weights or our gradients are clearly incorrect, so before we do the next iteration and optimization step, we must empty the gradients, so we must call weights. Dot red dot zero underscore, and now if we run this, then our gradients are correct again, so this is one very important thing that we must note during our training steps and later we will work with the Pi Torch Built-in OPTIMIZER. So let’s say we have a OPTIMIZER from the torch optimization package so Torch Dot Optim dot SGD for Stochastic gradient descent, which has our weights as parameters and some learning rate. And now with this optimizer we can call or do a optimization step, and then before we do the next iteration, we must call the optimist a optimize a dot zero gret function, which will do exactly the same. So yeah, we will talk about that optimizes in some later tutorials, but yeah for now. The things you should remember. Is that whenever we want to calculate the gradients, we must specify the require scrub parameter and set this to true. Then we can simply calculate the gradients with calling the backward function and before we want to do the next operation or the next iteration in our optimization steps, We must empty our gradient, so we must call the zero function again and we also should know how we can prevent some operations from being tracked in the computational graph, and that’s all. I wanted to show you for now with the autocrat package. And I hope you liked it. Please subscribe to the channel and see you next time bye.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more