Transcript:

[MUSIC] Hello, welcome to my whiteboard series. In this video, we are going to derive the gradients of softmax and cross-entropy loss function for classifications. So why do we use this combination for classification because when we have a balanced classes when the distribution of our classes are uniform, for example, if you have two classes if it’s 50/50 then this combination of cross entropies and softmax results into a very smooth cost function, which is very easy to optimize and all optimizers will work with it very well, so let’s define the definitions of softmax and then cross-entropy. So how do we define softmax layer this layer? We define it for every output index. I represents the class of the output. We define it with the exponential of the corresponding inputs, so it’s X of exponential of X of I and it is normalized by all the by the sum of all the activations, which is sum of let’s use K for the summation index for the sum of all the activations, so in order to calculate the activation of this, we need to get the output of the corresponding index and also divided by the sum of all the activations. That’s why the gradient of the activation at height also depends on the values of the other activations. And that we will see when we drive the gradient’s, okay. What is the other definition for the cross entropy layer? We define the cross entropy as the and negatives, some of the label, we define it with why, without the hats of every class times the log of corresponding class log of the output of the corresponding class. Sometimes we also call this the log loss because there is this log, which makes it smooth to optimize, okay. First step, let’s derive the gradient of the softmax layer with respect to random inputs, so we want to find the gradient of a chosen output with respect to a chosen input, not necessarily matching, so we need to consider two cases when so we are going to derive the derivatives of softmax and we need to consider to cases where the output index is equal to their input index. And then it is not okay. We will drive these two cases, so if they are the same that the index is. I and J we look at this function. We want to take derivatives of Y hat. I with respect to J. I X of I. Because the activation of this output is this input or you can say the form input of this output. Is this so they’re having both high index because this is the assumption? Now, what do we do? I’ve used the quotient rule to derive this to use to find derivative of this in the numerator. What is the quotient rule definition? We multiply derivative of numerator times denominator, minus numerator times, derivative of denominator dividing by the denominator squared, so let’s start with the denominator squared. We have sum of X K squared, and then we need to multiply. We need to start the numerator by finding the derivative of the numerator with respect to X I, which is itself the derivative of exponential is itself. So that’s first now we need to multiply it by the denominator, okay, second, we subtracted by the opposite, which is the numerator the derivative of denominator and the derivative of the denominator with respect to X I is again due to their X. I because this summation goes over all the values of I or K starting from one to all the classes N. So at some point, there will be a index I which will match this. I and the rest will be constant, so their derivatives will be zero, so the derivative of a summation is only equal to the derivative of the term, which we are calculating its index. So it will again be E to the X. I because all the other K indexes, which are not equal to I are considered as constants and its derivatives will be zero. OK, let’s simplify this, so what we can do is using the definition half the softmax we can simplify this, so we have two of these terms in the denominator so and we have same comment. E to the X ice. So we have one. Y hat I so this will go. This will go and this squared will also go now. We have this remains and another of this remain and this numerator remains. So when we divide this by the numerator, we will have 1 and this by the numerator will be again another output. So this is the derivative of the softmax. When I is equal to J OK, now, what is the case when they are not equal? Okay again, the same thing we use the quotient rule, so the derivative of this the numerator. So the denominator is the same, but here, the derivative of the numerator with respect to X J, which J and I are not equal, is zero, in fact, so the first term is actually zero for this case and second term. – is the numerator times, the derivative of the denominator, with respect to XJ, and similarly, with previous case, it will be equal to XJ, okay. We can simplify this further as well. We have this negative and one of this with one of the numerators will correspond to output as I and another one will be output at J and there’s also the negative. Okay, so these are the two derivatives we will need. I use for deriving the cross entropy gradient. Okay, now let’s start with the gradient of the cross entropy, lets. Start here, okay, So we want to calculate this derivative with respect to a random output, right. I chose an output, okay, So this is is the this was derivative of Softmax. This is the derivative of the cross-entropy. Now, what do we do? We need to take derivative of this first with respect to. Y I, and then, as since we will use the back propagation, we will take its derivative with respect to X. I because we want to combine the derivative of these two a for back propagation. So what is the derivative of this with X with Y hat is simple, The negatives remains, and the summation also remains. And why I remains because it’s a constant now. What is the derivative of the log with respect to Y hat? I its inverse of Y hat. I from the derivative rules of natural logarithms and that’s it. This is the derivative of the cross-entropy. Now when we want to combine, this is I as well now we when we want to combine, we want to. In fact, get the derivative of the laws with respect to one of a chosen inputs, so we need this plus for the backpropagation because there are no parameters here to optimize. There will be parameters here, so we need to back, propagate the loss from here from this layer and then from this layer, so we need to calculate. We need to propagate the loss of this from to this and then to this. So how do we do that here? We need to combine the two gradients. How do we combine? We need to assume that this summation is, we need to open the summation into two cases where I is equal to J and where I is not equal to J. So let’s first answer the case when it’s not equal to J, what will be its gradient? It will be same thing, except we need to multiply the gradient of this, with respect to XJ so because of back propagation or the multiplication rule, we, in order to calculate gradient of this with respect to this way to multiply the gradient of this time’s gradient of this. Okay, this is the case when they are not equal. And what is the case when they are equal, there is no summation anymore because there is only one case of that that case will be when I is equal to J so it’s YJ everything will be J anymore after this and the output at J and also the gradient with respect to also everything will be. I because this is the case where I is equal to J, but they’re same so we can consider everything as I or everything as J so Ill. Just take everything else because this is the case where we assume I is equal to J OK? Now let’s calculate this and combine. OK, so the first step we need to use this gradients be calculated. What is this gradient? We calculated its – hats. I why hats, J And for this one, we will use this gradient, which will be y hat or J hats. Everything is Jay here times 1 minus Pi J Hat. Okay, let me change color. After this, so now we will simplify, and we will derive a very elegant and simple equation after this, so we will cancel this and this right, and we can also cancel this and this. So ultimately, the derivative we are trying to calculate is negative or these negatives also cancel, so we will have only a summation when they are not equal X. Y of I label of Y multiply it by Y hat of J. This is the first term second term will be. We can multiply everything this and this, so it will be a plus y. I YJ everything is J right. Will be YJ Plus X y hats, J and then minus YJ okay here. The trick is to combine again so first we divided the summation to two cases now. We need to pull it back together into one summation. How do we do that because these two are, in fact terms if you can stir? J As I you can put it back to the summation and take the summation over all the values, so it will be over all these values minus Y. Yj now again, Another fact is that since the labels are one hot encoded their sums over the class over the entire number of classes, which we have n of them, so the sum of these over entire class and also the sum of these into over entire class will be equal to one. Therefore, this part, so we can take this out because it doesn’t have any I in it. So the sum of this part will be one and we will be only left with Y hat J minus Y I, which is the simplest gradient. You can think of considering that the actual forward derivations are quite complicated. The derivatives turn out to be very simple and elegant. Even though the forward pass computations are more difficult. Okay, that’s it for this gradient. I hope you can use it this way. You can implement your own neural network from scratch for a classification problem and debug it and use different techniques as you wish. Thank you for watching. Leave your comments below for any questions. And that’s it for this video. [music] you!