Transcript:

Well, now it comes time for us to look at the motivation and the origin and some intuition about cross-entropy loss from a likelihood perspective and we’ll also talk about where the log comes from. So let’s see where this all begins. What is likelihood that’s not obvious, but we need to have a little experiment in order to do this. So we’re going to set up an experiment of picking candies out of the bag. Yeah, it’s good to see that an example. Yeah, so let’s do it. Let’s say we have the actual training set. We pick 5 candies up, so pick your 5 candies. Yeah, say to blues and or to reds and 3 blues. That’s so basically. This is our training set. We just have we just pick some candies. This is not right now. It is about to start so I should write it here. For example, we have 5 samples from this distribution 1 2 3 4 and 5 right and so first is Red Rock and bunnies. I got a blue red blue and blue, right, So now the training set we can calculate probability of red exactly. So if you just count the numbers, just a probability distribution training set, you would have something like these. P of Red is equal to 2 out of 5 5 samples varies and a PDF blue is 3 out of 5 samples, right, and so if we somehow make a measure of this, what was the likelihood be? We’re just going to multiply the probabilities all together, right, and we do that so basically let’s say we have guests about this distribution. A bad guess is something like that Q of. I don’t know, red is for over five. Thank You of blue is one over five and this is a bad. Guess a good guess could be something like this More similar to the Chinese. Say something like Kiev. Red is two and a half of our five is fine, okay now. What do how do the numbers work now for white? Oh, you just miss this part. It’s fine, that’s fine. Yeah, so maybe just two. All right, so now. What do we do with these numbers? Are we gonna multiply two fifths squared times, three-fifths cubed and do the same for the other’s, right and see where those numbers come up exactly. So if he if this model wants to describe his training set, let’s see what’s a light load, so yeah, likely we don’t feel bad, Which is a basically Vector is four or five and how many times this happens. It is two times so square and 1 over five. And how many times this happens is three three times so. I’ll just borrow that. So this is equal to 16 over 25 times 1 over 125 Okay, let’s try the good likelihood. So if you assume this model describes this training set the like that would be something like, you know, Key off good two-and-a-half over five squared – X + 2 and 1/2 over 5 3 times. Yeah, so this is really great because this is 1/2 squared and this is a half cube exactly. So it’s 1 over 2 to the fifth, which is 1 over 32 which is a lot bigger than this. Yeah, if you compare these two, this is already smaller than this. And it also is multiple – number less than 1 so this is bigger, so this is better. So this is more likely and it’s closer to the original testing set training set. Well, that’s good so now we have a motivation for the theoretical formula, so let’s erase this and then in the next frame after we do the erasing, which we’ll take for another 6 hours so please. If you’re watching, don’t go away. It won’t be 6 hours. It will just be a couple of minutes and you can see by doing simple numerical calculations. You get some intuition about what the likelihood is so that when you see all the algebra in the next slide, you won’t want to leave the room or hide your face because it does look a bit scary, but it’s really just a symbolic extension of what we have just discussed. Now we’re doing well here. Okay, now there’s the formulas So, Muhammad. Will you take me through that first white block, so this one? Yeah, yeah, so this is exactly what we did in the previous slide. So these are the probabilities by the model for different classes and NPI is the like number of that. Those samples, the number of those samples so in our case for q1 if it was red, that NP would be to NP I would be to, and then Qi ^ – exactly exactly, and then the other one would be X the other likelihood of Blues, so we have that likelihood function now we’re going to maximize the likelihood function, so we have a product of things and I with only two, it’s easy, but what if we have a hundred and fifty? Then it’s going to be complicated to find the derivative of product of a hundred and fifty unknowns exactly so basically, we have a guess we have a prediction by Qi and we want to, and this is the like likelihood of this distribution based on this prediction Qi. We want to maximize that. It’s a product. You can’t do that so what we can do is to change it to something else, and that’s their logs comes in, so what when we take a log of a product becomes of sum and then we take derivative of some, which as much as much easier and also log tends to compress things. So if we have large dynamic range, we’ll be able to see that dynamic range and there’s one final thing that the log function is. Do you want to draw the graph of the log function? So what kind of graph is that where it’s always increasing? So somehow somehow a monotonic monotonic? So what happens when we apply this transformation to this product the minimum, so the point is India if the in the Indies in this the original function Dojin formula, it has something like Y mean, which is less than Y for all the values. If you take load here and look here because of the monotonic behavior. Sorry, that will be log so will still be true exactly the location. The point that minimizes this that results in this minimization results to this minimum point won’t change, so if the in the original formula, the minimum is somewhere here, even after log this point, one won’t be change. Yeah, that’s right, because all this does is do a compression. So you can think of the compression is just crunching it down sideways and that minimum would still get crunched down. The most is that, and therefore it’ll stay in exactly the same place and actually, they only care about this point. We care about, and this point is the parameter to, that’s right. Very good work in order to adjust our weight model. Yeah, exactly our model. So that’s basically it so maybe. I can also do that or is here, but I think we still have it Under. We still have to go back to the original cross entropy formula and look at it. So can you make a comment about this last equation? So that’s basically the same if we take the log of this like view it results in this expression, which can be simplified like this and we are from near VD sub X. This is cross-entropy so basically maximizing the likelihood is the equivalent to minimizing the cross entropy. Right, okay, great, so now let’s see where that takes us. Having converted our like likelihood to log likelihood and showing that it’s cross entropy. Then let’s look at our expression from before our binary exactly. So this is what we had in the previous slide for Binary Cross Entropy Binary Classification. So if in the actually it’s just the easy plugin, if for the P and Q, we have two labels, two classes and just plug that in in the cross entropy formula, we get the binary cross entropy. Since probability of one class is what if if probability of one class is by the probability of the other class would be 1 minus 1 and that’s how we get this binary cross entropy for 2 That’s what we had before. In the example we did earlier yn. The desired output is Y, And then the estimated was what we called a exact so there it is, that’s the formula and just a fantastic result to help us understand that where the entropy is minimized that means the disorder in the system is minimized, so that’s coincident with the extremum point, which corresponds to maximizing the likelihood or minimizing the entropy. Good work! Oh.