Transcript:

We’ll welcome to our discussion of different types of activation functions, and today we are going to look at binary cross-entropy. So let’s begin and I’ll give this to Mohamed Mohamed. Take us away here, Sure. So basically, we have the same setup. We have the input/output they have the activation function. Sigma and we have a desired output, so the same basic setup. You have a B now we didn’t have B before. Yeah, actually, that’s an easy extension now. They also have the bonus this term, which will be added by. WX and we have the same minimum error square error. Yes, see how it works. If you use a mean square error to minimize this the distance from this desired output with the promoted output, Yeah, and Sigma is our standard sigmoid function? That we’ve looked at earlier, so lets. Run this wrestle! See what happens to the cost as we do gradient descent. Holy Smoke, look at that. Output desired to zero, and it goes point zero nine. Well, that’s pretty good. What would you like to do next? I think you’re very lucky since we were already near. That is all red. These are values for the baits and by us. Let’s see if they start a bit further from the disorient, so we’re going to go far away. Yeah, from the moon somewhere, so they are going to change this gear Start and W start, OK? Let’s do that and by we’re pretty far away. We’re at two and two exactly. So let’s see what happens. It doesn’t look good. Wow, after 300 hundred iterations were nowheres near we as we were before, and it seems like where you start is important. Yes, and this is with basically one neuron and we have 50 million. This is a problem, so we better figure something out to fix it. Yeah, let’s see what’s the reason for this problem. Yeah, what is it goes, so slow there. I’m gonna boy stur is like a flatline here before going down, Okay, let’s go and see why, and basically we are using gradient descent, so we should look at the gradients, right, and if we take the gradient of C with respect to W here. This is the function that’s a result, so we have the Sigma Prime. We didn’t have it before and Sigma Prime, So Sigma is a function that has basically two asymptotes, right, so when we’re stuck on the asymptotes the red curve. I guess must be the derivative curve. Correct, so basically, This is the sigmoid function, the red! The blue one is the Sigma. The red one is the derivative of Sigma, and as we can see here and DC, with respect to the partial of Z with respect to W is dependent on the Sigma Prime of Z. So if you start at a location which is far away from that from a real answer, we would get like small gradients, so there is no training, so there’s no feeling about where to go basically because Sigma Prime is almost zero. Exactly so since you’re on a flat surface. You don’t know where to go, so this is. This is a problem where we had a bad starting value and we didn’t get enough gradient to push us in the right place, so we either try to find a way to get a better starting value or maybe we need to change the activation function in some way, And I think let’s try that one now. So basically, you should find a way to get rid of this. Sigma prime, this is the problem. We don’t want to have this prime here So basically, the activation functions. So let’s see what’s next. Yes, so what if you use? This function is so complicated. Maybe not that complicated, but look where. I understand squared error. We we use that for fitting lines, but this one has locks. Why is the desired output and AIDS? The activation, which depends on Zed, which depends on our parameters. We have at least three chain rules to do. Thank goodness for automatic differentiation Because I never want to do this for 50 million parameters. So take us through this. What’s going on here? Yeah, let’s get more feeling about that. For example, first notice that this a is the output of Sigma, so it always be between zero and one and not exactly zero and one it’s always between zero and one right, that’s one point and we want to make this a, which is our prediction close to the ground truth too close to the desired output, right, that’s our goal, so see, if, for example. Y is zero. What happens if Y is zero? This part cancels out because log is is a log of a or log of one minus a in this case. Log of a is bounded. So if Y is zero y log a got zero exactly so we would be left with this part. Okay, y is zero, this part. Who do you want? Yeah, so in order to minimize this last function, this would minimize. Ln 1 minus a so a should be zero -. Oh, okay, it should be close to zero, so y equals zero. A will be close to zero. How about the other side? The other side is also easy. Why is one This part cancels out? So Y is 1 Y? Minus 1 minus y is 0 so that part goes away and y is 1 and then log a if a is 1 then or near one. Then that’s near zero so 1 times 0 0 and the cost function is minimized. It means exactly, OK, good, So shall we try it? Let’s see what happens. Oh, by the way, that’s sort of like, an analysis of numbers, right, but later on, we’ll try to understand where this formula actually comes from so basically in future videos, we will come. We’ll talk about the origin of this function, but for now, let’s just see it as is, lets. See it working, OK? So what the care about? We care about the gradients and let’s take the gradient of this function and see what happens and that’s an exercise. Less left for students is just straight, simple chain rule to get the gradients and you can see the results and those are the results that we’re going to use in our gradient descent. So when we use those formulas, let’s see what happens just up. Yeah, so basically, if you do that if you take the gradient, it should be easy to follow. And here’s the magic there is no. Sigma prime of Z here, right, There’s no Sigma prime and the final gradient formula at the bottom of the stack of calculations. Good, so it looks like we fixed the problem, but the proof of the pudding is in the eating, so let’s eat the pudding in the next slide, and there’s our model and I’m gonna run it and see what happens so basically the same setup, same values. And if you are close to the DS, I read like the goal of the dead, Good B and ws the cost would go down. Yeah, it was the same for MSE. The thing is, though when for MSE, the final output was point zero nine so slightly better. Okay, so now that’s when we’re closed. Yeah, now we’re gonna go and put. B and W equal to before, and we remember that the cost function was very flat and it took a lot of iterations and the final answer. Wasn’t that great, so we change these values. These values to like, make the further from the real answer. May that make them further, okay. I’m gonna rep run it. Holy smokes, look at that. Yeah, it’s like 0-5 almost the same, and I think also. He ran at three hundred iterations. Wow, so this is a kind of cost function. We want to use in wid cot. Which situation because it sounds like it’s a binary situation, so it’s going to be used for classification binary classification. Exactly, that’s right, so too far, the problem with English translation is it somehow solved? They have a better care of everything here. They have a better cost minimization here with this formula combined your across entropy, but if we have lots of classes to classify, we’ll have to modify this in some way, which we’ve really seen in future videos. Okay, well, that’s a great job, and I guess now because of wait things. Are we don’t shake hands anymore? We found homes.