Transcript:

Hey, what’s going on? Everyone in this video We’re going to discuss a problem that creeps up time and time again during the training process of an artificial neural network. This is the problem of unstable gradients and is most popularly referred to as the vanishing gradient problem. So let’s get to it. [music] all right. What do we really already know about gradients as it pertains to neural networks? Well, for one when we use the word gradient by itself we’re typically referring to the gradient of the loss function with respect to the weights in the network. We also know how this gradient is calculated using back propagation, which we covered in our earlier videos dedicated solely to back prop and finally, as we saw in our video that demonstrates how a neural network learns. We know what to do with this gradient. After it’s calculated, we update our weights with it. Well, we don’t per se. But Stochastic gradient descent does with the goal in mind to find the most optimal weight for each connection that will minimize the total loss of the network. So with this understanding we’re now going to talk about the vanishing gradient problem. We’re first going to answer well. What the heck is the vanishing gradient problem? Anyway, here, we’ll cover the idea. Conceptually well, then move our discussion to talking about how this problem occurs, then, with the understanding that we’ll have developed up to this point, we’ll discuss the problem of exploding gradients, which we’ll see is actually very similar to the vanishing gradient problem and so we’ll be able to take what we learned about that problem and apply it to this new one. So what is the vanishing gradient problem? We’ll, first talk about this kind of generally, and then we’ll get into the details in just a few moments in general, This is a problem that causes major difficulty when training a neural network more. Specifically, this is a problem that involves the weights in earlier layers of the network recall that during training, stochastic gradient descent or SGD works to calculate the gradient of the loss with respect to weights in the network. Now sometimes and we’ll speak more about why this is in a bit The gradient with respect to weights. In earlier, layers of the network becomes really small, like vanishingly small, hence vanishing gradient. Okay, what’s the big deal with a small gradient well once? SGD calculates this gradient with respect to a particular weight. It uses this value to update that weight, so the weight gets updated in some way that is proportional to the gradient if the gradient is vanishingly small, then this update is in turn going to be vanishingly small as well. So if this new updated value of the weight has just barely moved from its original value, then it’s not really doing much for the network. This change is not going to carry through the network very well to help reduce the loss because it’s barely changed at all from where it was before the update occurred. Therefore, this weight becomes kind of stuck, never really updating enough to even get close to its optimal value, which has implications for the remainder of the network to the right of this one weight and impairs the ability of the network to learn so now that we know what this problem is. How exactly does this problem occur? We know from what we learned about back. Propagation that the gradient of the loss, with respect to any given weight is going to be the product of some derivatives that depend on components that reside later in the network so given this the earlier in the network. A weight lives. The more terms will be needed in the product that we just mentioned to get the gradient of the loss with respect to this weight. What happens if the terms in this product or at least some of them are small and by small, we mean, less than one small well. The product of a bunch of numbers less than one is going to give us an even smaller number. Right, okay, cool. So as we mentioned earlier, we now take this result the small number and update our weight with it recall that we do this update by first multiplying this number by our learning rate, which itself is a small number usually ranging between point zero one and point zero zero zero one. So now the result of this product is an even smaller number, then we subtract this number from the weight and the final result of this difference is going to be the value of the updated weight. Now you can think about if the gradient that we obtained with respect to this weight was already really small, ie vanishing, then by the time we multiply it. By the learning rate, the product is going to be even smaller and so when we subtract this teeny tiny number from the weight, it’s just barely going to move the weight at all, so essentially, the weight gets into this kind of stuck. State not moving, not quote learning and therefore not really helping to meet the overall objective of minimizing the loss of the network and we can see why earlier weights are subject to this problem because as we said that earlier in the network, the weight resides, the more terms are going to be included in the product to calculate the gradient and the more terms were multiplying together that are less than one the quicker the gradient is going to vanish. So now let’s talk about this problem in the opposite direction, not a gradient that vanishes, but rather a gradient that explodes. We’ll think about the conversation we just had about how the vanishing gradient problem occurs with weights early in the network due to a product of at least some relatively small values now think about calculating the gradient with respect to the same weight, but instead of really small terms, what if they were large and by large, we mean greater than one. Well, if we multiply a bunch of terms together, that are all greater than one we’re going to get something greater than one and perhaps even a lot greater than one. The same argument holds here that we discussed about the vanishing gradient where the early in the network a wait lives. The more terms will be needed in the product We just mentioned and so the more of these larger value terms we have being multiplied together. The larger the gradient is going to be thus essentially exploding in size, and with this gradient, we go through the same process to proportionally update our weight with it that we talked about earlier, but this time, instead of barely moving our weight with this update we’re going to greatly move it so much so perhaps that the optimal value for this weight won’t ever be achieved because the proportion to which the weight becomes updated with each epoch is just too large and continues to move further and further away from its optimal value, so a main takeaway that we should be able to gain from this discussion is that the problem of vanishing, gradients and exploding gradients is actually a more general problem of unstable gradients. This problem was actually a huge barrier to training neural networks in the past. And now we can see why that is in a later video. We’ll talk about techniques that have been developed to combat against this problem for now lets. Take our discussion to the comments. Let me know what you’re thinking. Thanks for watching, see you next time. [music] you!