Transcript:

Hello, people from the future. Welcome to normalized nerd in this video. I’m gonna try to answer one question that you probably have thought about, at least once while learning machine learning and the question is, why do we need binary cross entropy cost function when we have our good old mean squared error cost function. Well, I have a nice visualization too, so let’s get started. First of all, I’m gonna define the terms. Trust me, it’s gonna make our life a lot easier so here. I will be referring ground truths as Y and the predictions of our model as Y hat. Don’t worry about the architecture of the model because in this video, everything that I’m going to talk about will be based on this ground truth and predictions now the most common loss function we use is mean squared error loss and the formula is very simple 1 upon M times, the summation of all the square of the errors that is the difference between ground truth and the prediction, Obviously M is the number of samples in our training data. Now I’m gonna simplify this expression a bit. Because for the analysis, we don’t really need M number of data points. We are only going to consider one data point, so the formula becomes y minus y print whole square now comes the second most common cost function that is binary cross entropy loss and here. The formula is a bit tricky, but not hard at all. It just includes a couple of logs and in this case, Also I’m not gonna consider M number of data points. I’m just gonna consider one data point and I’m going to ignore the negative sign at the beginning. Because for our analysis, we are only interested in the shape of the function, so it becomes y times ln of y pred plus 1 minus y times ln of 1 minus y print by the way ln is just the natural logarithm. Okay, now we all know that mean. Squared error is used for regression problems and binary cross. Entropy is used for classification problems. But why is that well? There are two main reasons behind it. The first reason is when we tackle these machine learning problems from a probabilistic view, then main squared error arises naturally from the linear regression problem and binary cross. Entropy arises naturally from the logistic regression problem. And if you want to see the derivation, then I actually have a video on that. And the link will be appearing in the cards right now and will also be available in the description, but in this video, I’m gonna be focusing on the second reason, which is in a classification problem. The mean squared error does not penalize the model in case of a misclassification as much as it should do. But binary cross entropy cost function penalizes our model a lot for a single misclassification. That is exactly what we want. I know it’s hard to grasp at first, but let me show you something so here. I’m gonna try to visualize the square data loss and the cross entropy loss. So let us have the axis first. Now, the square error loss says this so here I’m going to particularly deal with a binary classification problem, so we will have two classes zero and one. Let’s see what the cost function becomes when we put ground truth as zero and obviously it’s a parabola or rather a section of the Parabola. Now let’s put y is equal to one this time. The loss is one minus Y print whole square and this is again a parabola but just shifted a bit, right, okay, so now comes the cross entropy loss again. I’m gonna first put ground truth as zero and then ground truth as one so for Y is equal to 0. The function reduces to ln 1 minus Y print. Let’s plot it, please notice that this function is not defined at X equal to 1 Now, let’s see the second case this time the function becomes ln of y print. The shape of this function will be exactly same as the previous one, but in the opposite direction. Now, please remember that I had omitted the negative sign at the beginning. So in reality, these curves should have lied on the upper portion of the X-axi’s, But that is not going to be too much important here. Okay, now, as we have visualized our functions, let us assume a scenario where our model has made a mistake here. I’m assuming that the ground truth is 0. But my model has predicted 0.9 that is close to 1 so my model has predicted that this sample should lie in the class 1 but in reality this should lie in class 0. So this is a case of misclassification. Now let’s see how the two different cost functions handle the scenario First. I’m going to calculate the cost. According to the squared error loss. So that gives me 0.81 now, let’s see the loss as calculated by the cross entropy function, it’s 2.3 The difference is not that much, right, but what we are really interested is the gradient because it is the gradient that will gonna penalize our model, so let’s calculate the gradient of LSE with respect to y print at 0.9 so it turns out to be 1.8 fairly low gradient right now, let’s calculate the same thing for the cross entropy function, so just look at that. It’s nearly 10 times larger than the squared error loss. Allow me to sprinkle some calculus here. So what do I exactly mean? When I say penalizing the model and that value is nothing but the gradient of the cost with respect to the weights and using the chain rule we can write. Let me write this thing for the cross entropy loss also. Well, now you can tell that which cost function is gonna change the model weights more, Obviously. It’s the cross entropy one. Now, what happens if our model makes a bigger mistake? Well, if you just track the tangent of the blue curve, you will see as the model makes bigger mistakes. The value of the gradient increases very rapidly. Now you might ask then why we don’t use this cross entropy in case of a regression problem? The reason is in a regression problem. We are dealing with continuous values, right, even if our regression model gives a little bit different answer from the ground truth, it should not cost the model too much, but in a classification problem, if our model gives different answer, then it should cost our model a lot, right because we are dealing with classes that is discrete values. So that was all for this video, guys. I hope you really understood why. We need binary cross entropy for classification problems and mean square error for regression problems. If you have any question, please let me know in the comment section, and if you have enjoyed this video, please share this video and subscribe my channel. Stay safe and thanks for watching [Music] you?