Transcript:

Everyone and welcome back to this class data science deep learning in Python part 1 in this lecture. We are going to discuss the cost that we want to minimize in our neural network. Let’s start by recalling the simplest cost function possible the squared error, which you should all recognize from linear regression, it is clear from this expression that the more different the prediction is compared to the target, the larger our error will be. Thus, it makes sense to want to minimize this error in order to train our model by the way. Now you also understand why we had to learn how our model makes predictions first. It’s because in order to find the error, which is the discrepancy between the target and the prediction, we, of course, must first have the prediction as you recall, minimizing the squared error is equivalent to maximizing the log likelihood where the error of our model is Gaussian distributed. Why is that lets? Consider what happens if each target T of N is a random variable, which is Gaussian distributed with the mean Y of N. The variance of the Gaussian is arbitrary, and it doesn’t matter as you’ll soon see, so we’ll just say it’s Sigma squared, then given this information, we can set up. The likelihood function as follows it’s the product of each individual Gaussian PDF where T of N is the random variable. Y of N is the mean and Sigma squared is the variance. It’s important to remember that this is a little more advanced than a typical maximum likelihood estimation problem because why is not meant to be optimized itself. It’s just the model prediction. What we actually wants you maximize with respect to is the model weights so we can think of. Y of n as a function of some set of weights, W and this W is actually what we want to find in other words. W is the Arg max of L, by the way. When you see the letter, L you might think of loss, But in this case, we actually mean likelihood, which is something we want to maximize rather than minimize, So just keep in mind that the letter. L is somewhat ambiguous, depending on the context since both of these words, which kind of mean opposite things start with the letter L. Now remember that whenever we’re solving maximum-likelihood problems, it’s more convenient to maximize the log-likelihood rather than the likelihood directly. So let’s take the log of L and see what we get. The first step is to apply the log to each term individually since the log and the exponential are the inverse function of each other, they cancel each other out what we’re left with is essentially a function that looks like this where C means constant. Now, what’s interesting about? This is that constants don’t actually matter they might yield a different value for the actual likelihood itself, but the likelihood value is irrelevant since what we care about are the weights. W we just want to minimize L. No matter what the scale is in other words, it doesn’t matter what these constants are. The value of W that minimizes the function will be the same. In other words, we’ve shown that maximizing the likelihood is equivalent to minimizing the squared error. Let’s now discuss a slightly more complicated scenario, which is binary classification as you recall for binary classification. We use the binary cross-entropy error function, which is as follows it’s the negative of the sum over. N T event times log of Y of N Plus 1 minus T of N Times log of 1 minus y of N. The question now is what’s the log-likelihood equivalent of the binary cross-entropy? Well, it turns out that this corresponds to a Bernoulli distributed random variable. In this case. T of N is again. The random variable, which is Bernoulli distributed and Y of N is the mean of the Bernoulli distribution. As before. If you want to prove that, this is true, you can just write down the likelihood. I think it’s almost immediately obvious that. When you take the log of this and negate it, you get back the cross entropy error function, which we just saw and so what we realize is in both cases for both regression and binary classification, our cost function or our loss function is actually the negative log likelihood, and by the way, keep in mind. I’m just using the letter L for everything here. So don’t equate them to each other across different slides. Finally, we have the scenario for this course, which is multi-clas’s classification with the Bernoulli random variable where you only had two choices. The sort of real-life interpretation is a coin. Toss a coin toss can only give you heads or tails when you have multiple possibilities. The random variable comes from a categorical distribution. Instead, the real-life interpretation is a die. Roll, so when you roll a die. The result must be a number between 1 and 6 inclusive. If you want, you can do a simple example of solving the maximum-likelihood problem for an actual, possibly biased dye. In this scenario, you would have a series of dye roles. For example, 1 1 6 3 2 5 4 3 Importantly, it’s convenient to represent these with an indicator matrix, which will call T of N K T of N K would be set to 1 if we rolled K on the enth role. Otherwise it would be 0 You might also want to think of this as a one hot encoded matrix since for each of the N rows, only one value can be a 1 and the rest must be 0 by the way as a side note. Please remember that when we discuss math? Our indices usually start counting from 1 but when we discuss programming, our indices usually start counting from 0 so that’s just something to keep in mind and be aware of now because this is a die. We don’t have any model predictions, so there are no wise. Instead, we just have the probability of rolling each of the K values lets. Call them! W 1 W 2 up to W 6 Then our likelihood is written as follows it’s the product over all N and the product over all K of W Sub K to the power of T of NK. By the way, it’s good to remember that the likelihood is just the product of Pmfs or Pdfs for each of the N data points. So if you are not familiar with the PMF of the categorical distribution, please check out Wikipedia to remind yourself from here, it’s easy to see how we would apply this to a neural network. Our output probabilities Y of NK, tell us our prediction for each of the targets T of N K. So if Y of N K is large, then that means T of NK is more likely to be 1 In other words, we can say T of NK follows the categorical distribution given the probabilities Y of N K. So this is the same as our die roll. Example, except we’ve replaced the number 6 with a more generic. Big K and the probability for each category is y of NK rather than just a fixed WK. As usual, our loss is the negative log likelihood, and so we can say that the categorical cross entropy loss function is nothing but the negative log likelihood given that your distribution of the targets is categorical as mentioned previously, however, maximizing the log likelihood is exactly the same as minimizing the negative log likelihood. And so I find it useful to just get rid of the negative sign completely since we’ll just end up, carrying it over on each line of our derivation, which gets tedious and is basically redundant as a final note. Let’s try to build some intuition to make sure the loss function does what we actually wanted to do, remember if. Y of NK is very wrong. We want to allow us to be large, but if Y of NK is very close to the target, meaning it’s very right. Then we want it to be small. Let’s assume we’re working with only one sample, so there’s no index red. Then our loss is just that negative sum over K T of K Times log of Y of K. Now, suppose we are exactly right, then for the T of K, where T of K is 1 we also have Y of K is 1 Log of 1 is 0 so the total loss would be 0 which makes sense. Therefore, if we are perfectly right than our minimum loss is 0 let’s say now that we only give 50% probability to the correct target. Then we have 1 times log of 0.5 negated, which is 0.693 now let’s say we only give 25% probability to the correct target. Then we have 1 times log of 0.25 negated, which is one point three, eight six. Finally, let’s say we are completely wrong. We give zero percent probability to the correct target. Then we have one times log of zero negated, which is infinity in other words. This is very similar to the squared error, where the worst value is infinity and the best value is zero. Therefore, it seems that this loss function works as expected. Let’s recap. What we’ve done so far and remember what our plan is as you recall. We have two steps, step number. One is to define our loss. We just did that in this lecture. Step number two is to minimize the loss with respect to our neural network ways, this is going to involve gradient descent or gradient S/M and therefore, the next few lectures will involve finding out what these gradients are.