Sparse Softmax Cross Entropy | Categorical Cross – Entropy Loss Softmax

Matt Yedlin

Subscribe Here





Categorical Cross - Entropy Loss Softmax


Welcome to our next video. In this series, you could see that we kind of messed up the front title slide because we had to keep reminding ourselves the formulas from the last time, so we’re going to just leave these formulas up for a moment and talk about them. Last time, we talked about binary classification using the sigmoid activation function and the binary. Cross-entropy Mohamed. Can you remind us about what we did with these two formulas? So we used the cross entropy as our loss function and if you have only two classes, some sort of binary classification, yes. I’m sort of easy to spam or not. Is it talk or not? Just two class? We use this this formula with two class, which would be this one binary cross-entropy. Yeah, so, for example, p1 will be this this value here, which is going to be a probability between zero and one. Yes, it’s complement and the same with the predicted from the model. The QI. So this is one Y hat for the probability of say class one and the other one will be the complement one minus y hat, so that’s for two classes, gives me an idea if we have more classes, let’s see, could we use it for more classes, so we don’t need to see binaries anymore, and I’m just curious as to whether we can use the the formula here for number of classes greater than two and still use the Sigma sigmoid activation function. Actually, yes, the first thing that pops into our head is to just. Have you have multiple classes? Yeah, so we have. We can have like multiple signal signal signal. Well, why don’t we just do that for the Y and then evaluate the entropy, which is probability? Yeah, these are probabilities. You eyes are here. Yes, but the point is each one of these. Sigma’s you do have a value between 0 & 1 all right, it has that value asymptotic to 1 later on, so that’s a problem when we start summing exactly so the sum may not add up to 1 but here the QI should be probability, so the sum of different classes should be the value entered, so we need to do something to fix the class normalization, exactly so that we can get probabilities instead of overflowing beyond 1 right. So what should we do? Ah, here’s a formula that you’ve presented that soft Max. That’s a normalization over different classes, right, so we can see that as we sum up that Sigma as it is, there we’re going to get outputs between 0 & 1 so basically, this is the input to last layer. This could be our activation or like prediction since this one is last layer, so if you start with these values, so if everything is 0 if there were 4 for class for classes and everything is 0 the probability of each class is 1 over 4 which is our cold life. Yeah, a quarter so I will let this simulation. Iran, let’s see that for different values so here we go notice that the blue bars in value for all the four classes still add up to 1 so z2 is big and so is. AJ. And now we change, it. Said 3 is the biggest and they are. Activation output is 0.8 so it seems to work so basically, what softmax does is to convert these different values, normalize them to probabilities and always maximize the class with maximum number maximize the probability of class with maximum number Here, right, so the idea is the biggest blue bar. We’ll give us the right classification and so now. I think it’s time to show a demonstration of this process. Listen, instead, see, an example sure. So the setup is sent is simple. These are the input to the last layer, and these are the results after softmax layers, the activations after applying softmax on this D layer. So after this, we would get our prediction, which are basically the same, the same column for different classes. This is probably for class 1 class 2 class 3 and Class 4 right, and this one is the grand Truth is the desired output. So we want to for this input. The class 3 is the right classification. So we want this number to goes up. So how do we do that? And we achieve that true learning and true dysfunction through minimizing this loss function, which you’ve shown here by substituting the results with the the Y is 0 0 1 0 at the front, and the arguments of the logs are the components of y hat and the entropy cross. Entropy is 2.9 So now we’re going to use that train it and I’ll show the next so actually see this output. This output is not desired output, and we can see that you have a. We have a high number for the last function, right, so you want to push that? L down as low as we can, and hopefully when we do that, the third value of Y hat will rise up and agree with our desired output. So let’s have, there’s the next round. So if true learning we achieve the Z layer like this, it would result in activation or output or prediction. Like that, yeah. I can see that now. We’re tied 2 & 3 are basically close to each other only 10% difference, and that is shown in the value of loss function penalty or less cost function. Yeah, so L is decreasing, so lets. Go again now! This one is a good. Actually, this is a good prediction. The desire addition is class three, and they assign the highest probability to class three. So now we can see the last function is to 0.5 which is like the lowest number compared to these three examples, so ideally we’d one a loss function close to zero of zeros. We can, but we’re never going to get there, so I have a question for you in we. We have four classes here. Cat dogfish and fig. Whatever we choose, what happens if we have a whole bunch of classes like 10 20 30 Azura because then those the elements there may it may be harder to get to the right one. So is there a limit practically in a practical situation for the number of classes that we can have in our model in terms of firm Lights should be fine. We can have any number of classes. Have you ever tried it with lots of classes yourself Actually Imagenet. Has it like something like a thousand classes? Using soft math isn’t something as the loss function. Awesome, the output. Wow, that’s amazing! Well, that’s it for us. We’ve shown that we have situation of good convergence and an adequate description in terms of the lead up from the sigmoid and let’s see where it takes us further.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more