Transcript:

So in today’s video, I wanted to take a deep dive into how Pi Torches. Auto Grad system works. So in this example, we create a tensor with the value of 2 assign it to the variable a create a tensor with the value 3 assign it to the variable B and then multiply a and B to get C. So in these diagrams, these rectangles are tensors and these internal values are the attributes of that tensor. We have the data attribute, which holds the data of the tensor. We have the Grad which will hold the gradient value. Once it’s calculated, we have the gradient function, which points to a node in the backwards graph. We’ll get more into this later. We have is leaf. Which says, is this tensor, a leaf of a graph and requires grad. So if requires grad is false for all tensors that are being input into an operation. The output will also be a tensor that has requires grad equals false and no backwards graph will be created. The output tensor will also be a leaf, however, if we set requires grad equals true when creating the tensor for a then when we pass this tensor into any operation, the output tensor will be part of a graph it will have requires grad equals true is leaf equals false since this is no longer a leaf of the graph and I’ll also have grad function, which points to a node in the backwards graph. So this is the part of the operation that we see, but behind the scenes, a backward’s graph is being created so here we can see what the backward’s graph is. When we call the multiply function, it has access to a context variable and it can store any values. It needs for the backwards pass in that context variable. This context variable will be passed to the MOL backward operation in the backward pass now to be clear the Safer, backward method and the safe tensor’s property are the Python versions of how the input tensors are referenced in the backwards pass, however, the multiplication operator, along with a lot of the other Built-in Pi torch operators, is implemented directly in C++, so this notation here of Safer, backward and saved tensors to retrieve those tensors isn’t exactly how it works in the code. It doesn’t actually call the Python written here, but they’re used kind of more as a symbolic representation of what’s going on, so if we look at the mole backward function, we see it has the attribute next functions and this is a list of tuples each one associated with the different inputs that were passed to this function so accumulate. Grad is associated with tensor A and none is associated with Tensor B. It’s none because this has requires grad set to false, so we don’t need to pass a gradient to it and accumulate grad is what is used to accumulate the gradient for the a tensor. So now if we call C dot backward, it starts the backward pass of the gradient, initializing it as 1 it passes into mole backward. It then sees that to get the gradient for a has to multiply the incoming gradient by 3 it then passes that to accumulate grad, which then sets the grad attribute on a to the gradient, which in this case is 3 and again since this B tensor doesn’t require a gradient, it’ll see this none value and it won’t pass a gradient along to it, so now to show an example of a graph, that’s a little bit deeper. We’ll start with setting a 2 to be 2 3 Both of them requires grad equals true. Well, then multiply them together to get. C will then create another tensor with the value for D and then we’ll multiply C times D to get e now. If we look at this backward graph, we see that this grad function points to this node. This grad function points to this node and we could call C dot backward to compute the gradients from here on back or we could call EDA backward to compute the gradients from here on back, and if we look at the next functions for this small backward, we see that the first value points directly to the previous small backward. This is because C was the first input to this multiplication function and C is just an intermediate node. It’s not a leaf, so we don’t need to calculate the gradient for it and we can pass the gradient directly to the backwards function associated with the function that produced C. Now, if we call it backward here again, we’ll start with initial gradient of 1 which will be passed to this mold backward and we’ll see that the input Tensors were 4 and 6 so the gradient for D will be 6 and the foresee will be for these values are passed up to the next mole backward, as well as passed to the accumulate grad function, which then stores this gradient in the tensor D when this four value is passed into this mole backward function, it again gets multiplied by the gradients for a and B for a the gradient will be three and four B. The gradient will be two, so we multiply 4 by 3 4 a and 4 by 2 4 B. And those values get passed to the accumulated grad functions, which then set the grad values on the tensors so in the backward pass when Mol backward retrieves these saved tensors to look at their values to compute the gradient. It does a one additional thing to make sure they haven’t changed in the time since the operation was performed in the forward. Pass so what it actually does. Is it stores? A version number with each tensor that’s created and anytime you perform an in-place operation Such as C plus equals one. This version gets incremented, so if we actually did this in place operation on C before calling a backward when we call the e backward, we would get an air when we tried to recall the saved tensors as I would look at C and see that it has a version of one now. And when it was passed into the multiplication function, it had a version of zero, so this is how it kind of prevents those types of heirs from occurring, however, if this function was instead, something like the add function where E is C plus D then the add function actually doesn’t need to save any tensors for the backward pass since the gradient is actually just passed through to the next node in the graph. So in this case, if we did an in-place operation on C and then called a backward, no heirs would occur because our backwards graph doesn’t depend on knowing the value of this C tensor now. In these next functions, these lists of tuples, they have a second value, which is 0 and I wanted to explain what that number is used for. In this next example, so first, we’ll start off with a one-dimensional tensor with three values. 1 2 & 3 will then call unbind on this tensor to create BC. And so what unbind does is it’s pretty much the opposite of pack. It’ll take the values along the first dimension and split those up into a list of separate tensors of those values and here we just unpack that list in the values. B C and D So B has a value of 1 C. Has a value of 2 and D has a value of 3 and all of the grad functions point to the same unbind backward function, which also points to a single accumulate grad function, which will accumulate the gradient for the tensor a if we then multiply all these values together B C and D to get e will actually multiply B and C together first and then multiply the output of that times D to get e so this will create two mul backward functions and here we can see the use of the second value 0 1 & 2 and these values are associated with the output index from the unbind function, so since this first value is associated with the B tensor, the 0 is saying that this is the gradient for the first output of the unbind function. This one is saying this is the gradient for the second output and the third output of the unbind function is here. So the reason these index values are needed is because the unbind backward function needs to know which output These gradients are for so that I can pass it along to the next node. We’ll come back down to the bottom and call each word to simulate the backward pass. We start off with the gradient of 1 coming to this small, backward function. We then get. The gradients 3 & 2 2 is passed directly up to the unbind, backward 3 is passed into this second mold, backward function, which then outputs 6 and 3 so we have 6 3 & 2 being passed into the unbind backward function, and it passes those along to the accumulate grad, which then gets saved into the a tensor. So I have one more example, which builds a little bit more complicated of a graph and it should show kind of the intricacies of how the auto grad system builds that graph now in these examples. I’m using mostly scalar values for simplicity, but you can also pass in any vector or matrix or any N dimensional array so here, we’ll start off with two values. Both tensors of you too. And they both don’t require gradient. Well, then multiply them together to get C, which also doesn’t require gradient. Well, then call C. That requires grad equals true to kind of activate this leaf so that any future operations done using this tensor as an input will start to build the backwards graph. We’ll also create another tensor, which doesn’t require gradient. Well, then multiply them together to get e and this will start to build the first node in our backwards graph so to explain these colors. I chose Brown is kind of for, like the branches of the tree. Even though this is a graph and not exactly a tree structure and then green is the leaves of the graph and then yellow are for tensors that are also leaves, but they’re not on the trees, so they’re kind of like dried-up leaves, and then blue kind of feels like a magical color and automatically calculating the gradients backwards through your graph, Kind of feels like a magical thing that’s happening so anyways. Next we’ll also create another tensor that doesn’t require gradient, another leaf that isn’t on the graph will then multiply these two values together to get G and again we can see the graph is only going up to the left Side of these values will then create another tensor that does require gradient, so this is an active leaf of the graph will then divide G by H to get. I will then add I and H together to get J here. We can see that. H is being passed into this division function and also this addition function and since it’s being passed into two functions, the accumulate grad node for H has two inputs, one from this div backward and one from this add backward will then multiply. J and I together to get K and again since we passed this, I tensor into both this. Add function and this multiply function. We’ll see a convergence of gradients coming up to this div backward function, so unlike above because this isn’t a leaf node, these two streams don’t converge on an accumulate grad node there instead passed the node associated with the operation that created this tensor. So as you’ll see where there’s a split in the forward graph, there’ll be a convergence in backwards graph. Now, at the end, we can call K backward and we’ll start with a gradient of one and we’ll slowly pass it up through this graph through all these nodes, and whenever we have a leaf node, those gradients will get accumulated and stored on the grad attribute for those leaf nodes, so this one is set to negative 64 and up here. C is set to 36 so by default, the gradients will only get passed to the leaf nodes and all intermediate nodes will have their gradients still as none. But if you want to save the gradient in an intermediate node, you can call retain grad on that tensor and that will set up a hook that gets called in the backward pass. That will basically tell this div backward function that any gradients that get passed into. It should be saved on the grad value of this. I tensor so here we call M equals K detach. And that will create a separate tensor that has the same data as K and they actually shared the same underlying tensor and M will no longer require gradient. It’ll be a leaf node and its grad function will be none meaning it doesn’t have a reference to this backward’s graph. The reason we want to do. This is most of the time we want. This backward graph to get garbage collected. We don’t want to keep it around longer than the training loop. So when we call K dot backward, there are some values that actually get freed in the graph, specifically the references to the saved tensors, but the actual graph still exists in memory. So if we want to store this output value for longer than the training loop, we’ll want to detach it from the graph before we do that and we can do that by calling. K that detach, which will give us a tensor. We can also call K Dot numpy, which will give us a number array. We can also call K Dot item, which will give us either a Python in tour a Python float, depending on the D type of the tensor, or if our tensor holds more than a single value, we can call two lists to get a Python list of Python in or Python floats, or if it’s a multi-dimensional tensor, it will return nested sub lists of in servlets. So I hope you found that informative. This was my first video related to Pi Torch, But I’ve been loving the library and I’ll be learning a lot more about it in the future. So if you want to see more Pi torch related videos, let me know in the comments and I’ll see you guys next time [Music]!