Transcript:

Hello, uh, my name is. Krishna, and welcome to my Youtube channel today. I’m going to show you a very interesting video, which is basically on exploding reading problem. You try to understand. Why, and how does an exploding radium problem occur when three is really? I’ve already discussed about vanishing gradient problems. If you have not seen that, please suggest before going to this particular video, you have a look on to that, so let us take a very good example to understand what exactly is exploding gradient problem. Here I am basically taking a two hidden layer, three middle layer neural network. So this is my two hidden layers, and this is basically my output layer, so my input features are basically getting parts to this The first field that I have basically assigned is w1 1 to the Suffix 1 then here, you know, the summation of weights and the input features happen and then an activation function like sigmoid gets applied after the activation function is getting applied and passing. I am assigning to assign another weight, which is like the Bluetooth 1 to the next neuron, which is in the next hidden layer. Let me just say that this. Oh, one one that. I am trying to provide the input because this overrun gets multiplied with old of w1 to make it simple. I am just writing it as Z. Okay now. This particular value. I’m passing over here. Same operation happens and similarly we will be getting the value in the output layer And after I get bound, but that is basically my Y hat. Now when I have Y hat, what I do in the next step, I just pass it to a loss function now as you know that if I’m passing it to a loss function that basically will have to reduce that loss value and for that, I’m basically using Optimizer now. How does an optimizer reduces the loss function is basically by updating the weights in the back propagation, so each and every wedge gets updated unless and until this loss value gets reduced and then that lost value gets radius. You will be seeing that. Y hat and Y, which is your actual data will be looking similar. You know, they’ll be having the same equal value, so let us just try to understand. How does the weight Updation happen as I have already discussed in the previous class? So suppose if I want to update W 1 1 to the Suffix 1 so I can write a weight updation formula? Which is there in this right hand side? I can write it. As W 1 1 U is equal to W on an old – learning rate multiplied by derivative of N Right, and with respect to the weights now, a very important thing. When does this exploding gradient problem happen? It basically happens in this particular problem stupid now when I’m calculating the relative of loss with respect to w1 1 now by using the chain rule by using the chain rule. I can write this how I am writing it now. See this my output. Oh, 3 1 is dependent on. Oh, – 102 One is getting impacted by Oh 1 1 and no one is getting impacted by w1 1 Now, if I want to find out the derivative of this, I basically need to find a derivative of all these values and for that. I’ll be using a simple childhood. My chain rule will be derivative of loss with respect to derivative of 8 is equal to derivative of oh 3 1 derivative of oh 3 1 divided by derivative of 4 2 1 Then O 2 1 is getting impacted by this, so I can multiply this derivative over 2 1 divided by derivative of or not and finally derivative. Oh, oh, 1 1 divided by derivative of W one month. Now this is basically a chain rule. Because see if I cancel all these things. This will be Del U Video 3 1 divided by derivative of W 1 1 right, so this is basically s chilhood OK? Now, as I said that let me just consider this particular derivative Because I will show you. How does an exploding gradient problem happen and it just it does not just happen because of sigmoid function, OK? The main reason this exploding gradient problem happens is because of weights. OK, it is because of weights now. You may be confused. How and just show you a very good example, because you know that sigmoid, you know, Whenever we apply a sigmoid activation function transforms the values between 0 to 1 and the derivative of all sigmoid function is also between 0 to 0.25 You know that right, so if you know this A derivative of sigmoid, right, it ranges between 0 to 0.25 If I say derivative of sigmoid like this, OK? The ranges between 0 to 0.25 OK, so let us just take this value only and lets. Just try to compute this value so here. I’ll write it as derivative of 4 to 1 divided by derivative of Oh 1 1 OK? I’m taking this. Let us solve this particular problem statement. This particular derivative only okay, now before solving this, You know that I am giving the Z value over here because I have just been in 201 1 and Z. OK, so here, you know that Z / what will happen? Oh, this Z basically is basically my function which is getting multiplied, which will multiply W to one. And, oh, one one. So suppose I write over here? I said Z is equal to nothing but W to 1 X. Oh, one one, plus the bias – over here. I have bias one where I have bias -, right, So after this after this, what this function does in the second set, it applies an activation function. So I am just writing this. Oh, to one in the form of the set value, okay. I am considering the multiplication of W to 1 X over 1 plus B 2 as said. Okay, because this is the operation that usually happens in this neural network back, and then I am basically applying an activation function and this activation function is basically my sigmoid. Suppose I have Sigmoid 1 divided by 1 e to the power of minus then. This is the activation function that is getting a plane. Now imagine, how can I write this derivative of o2 1 with respect to derivative of? Oh, one one that now you need to understand. I know that I am giving the set. Everything is happening on the Z right can. I write like this derivative of activation function of Z divided by just just just just focus on this. OK, now this will be derivative of Z because this is it. Everything is happening on Z. And similarly, I can write with the help of chain rule derivative of Z divided by derivative of oh one month because one one is impacting that, so I’m basically using a simple chain rule and basically using a simple chain Rule 402 one to calculate derivative of 4 to 1 with respect to derivative of 4 1 1 For that. I’m just saying that O 2 1 is nothing, But it is a function of activation function of Z, so I am writing derivative of an activation function of Z divided by derivative of Z multiplied by derivative of Z divided by derivative of 1 again. This is a simple chain rule. This is a simple chain rule. OK now! I know what is my derivative of Z. I mean, this activation function of Z. Write it if I if I just take this if I just take this. Let this particular situation be like this. I know that this is my sigmoid activation function. Because what is this activation function of Z? And I know that Seimone activation derivative will be ranging between 0 to 0.25 Yes, you know this, so let me just drop this again. This focus over here because this is important to understand. I know that the derivative works activation function of Z will be the engine between zero and point two five. We know this because in sigmoid activation function, we know that it transforms the value between 0 to 1 and my previous session also. I have discussed that derivative of this. Sigmoid function will always range between 0 to 0.25 OK, now let me just consider over here. The value will be ranging between 0 Suppose if I take this, I am just considering this as derivative. Because I don’t have a little bit space over here, so I’m just considering this. So this will be ranging between 0.25 now multiplied by I am multiplying that derivative of Z divided by derivative of. Oh, one one, let me substitute. What is that way? I know that is this particular value. This is this particular value itself, and when I do derivative of oh, one one derivative. Oh, one one. If I apply the derivative, you see that, oh, one one is also present. Oh, and this is a constant, and if you know, by simple rule of derivative, this all will get reduced to something called as W – W – one because over one one will get reduced. If you know derivative, this will become 1 and this will become zero so I can write like this now. Now understand for first scene? I do suppose I have point two five over here now. When will my derivative value will be higher? I told you the reason. The derivative values becomes higher is because of weights. Now let us consider that man weight over here. You realize this 500 okay, now, if I initialize 500 if I multiply this, my value will be 125 now when my value is 125 Just imagine for this particular element in my value is 125 If I multiply this, suppose for this, I calculate it. As 100 considering my weights are higher, so you guys weights whenever it is higher than own little performed is exploding radial problem. Now, for this particular derivative. My weights was very much higher, and I got this particular value as 100 and here. Also I got it as 200 Suppose if I multiply this, it will be a larger value now when it is a larger value this consider this. I am trying to replace this with a larger value. Suppose my learning rate is 1 so if I take the older value, minus a larger value that may become a very small value, but that is like a negative number and when it becomes a negative number this and this will vary a lot when it varies a lot understand. Guys, the gradient descent will never converge. It will be jumping here and there, you know, after each and every back propagation of the Box, your weights will be varying a lot with respect to the older weights and with due to that, you will never converge. You will never come to a point. You will never come to a global minima point, and that is where it is very, very important to understand how the weights are, basically meat sliced. How the weights initialization should should be done. Okay over here. You are just not using sigmoid because you may use anything. But if your weights are higher, what may happen, you may never converge to the global Minima point. So in my upcoming videos? I also show you how a weights actually are getting initialized. Just understand whenever we trying to find out one derivative. If my rates are higher, I usually get a higher value of derivative right, but you should know that my derivative of Sigma is between zero to point two five, but because of weights, this derivative value is coming higher larger and when it becomes larger because of the chain rule as I create a deep artificial neural network, this particular derivative with respect to W and one will become a very big number and when it becomes a very big number if I try to apply that in the weight Updation formula, then what will happen this, why old and why you will be having will be completely different, There’ll be huge gap between that and because of that, what will happen after each and every back propagation, it will never reach the global minimum point, so that is why this weight of nation is very, very important and this was all about exploding regime problem. I hope you like this particular video. So guys make sure you subscribe the channel. If you have not already subscribed, please do let me know if you have any questions regarding in the by putting your comments in the comment box itself. I will see you all in the next video. God bless you all. Have a great day ahead, thank you.