Transcript:
What is the best loss function? The age-old question in machine learning? Will we solve this problem today? Nope, but we will talk about some loss functions, their pros and cons and even discuss a recent paper on adaptive loss functions. So that would mean that you don’t need to keep trying out different losses to find the best one that suits. Your need’s fun stuff. OK, first, let’s focus on regression. I’ve got this data set here. It looks like a line can fit this data. I’ll train a linear regression model on it, and I choose to use the squared loss to minimize it. The l2 loss. My curve Looks something like this. Okay, looks pretty good, and it’s also pretty simple, but if I introduce some outliers in this data, my model responds by freaking the hell out and trying to fit those data points better. This happens because that square term scales the errors by these outliers, so the model really wants to get these obscure points, right. I’ll just change the loss function to the absolute difference. My model now treats the outliers like any other data point, so it won’t go out of its way for outliers. If it means compromising the rest of the model, this might lead to poor predictions from time to time, but if you really don’t care about the extreme cases, this will do support vector regression uses this by the way the advantage of the squared error is the ease with which we can compute the gradient for machine learning during gradient descent. This gradient is not as simple in the absolute error case because of the points of discontinuity, the mean absolute error isn’t optimized through gradient descent, but it’s optimized by computing sub gradients. Instead, it adds a bit more complexity and I’ll add some reading material In the description down below. We got two losses, one that loves outliers and another that ignores them. If you think that one doesn’t work, you’ll just use the other and that might be fine in most cases, but consider this. Our data is about, like 70% in one direction and 30% in the other direction. Technically, this data does not have any outliers, but our absolute loss may treat the 30% data as outliers and ignore it altogether, while the squared loss will try to capture those 30% Both decisions can lead to poor model performance. How do we compromise? We can do so by using the pseudo. Hueber loss. This is the best of both losses. If a data point has a relatively low error, we take the squared loss. If the data point is an outlier, we take the absolute loss. The result is that it reduces the effects of outliers on the model while still being at different. Schabel and as such, it’s slightly more complex. The main problem here is that we have an extra hyper parameter play with these are the most popular regression based losses that you see in Built-in regressor’s now for classification losses in classification. Our outputs are obviously the class, but more precisely it’s the list of probabilities of belonging to different classes, and we just choose a class with the highest probability, cuz. Duh, this list is a probability distribution. We compare this to the ground truth and how we compare it depends on the losses we use so cross-entropy loss Entropy has its roots in information theory. So I’ll explain it from that perspective, so say that there’s this weather station, and it sends you a weather forecast at the beginning of each day, and it tells you what weather it is on that day. Using some N bits of information in its best case say this information can be packed in as low as 3 bits on average 2 bits for sunny four bits for a rainy day, three bits for a partly cloudy day and so on the entropy of a distribution is the average number of bits required to convey a piece of information like today’s weather in this case. So the entropy, in this example is three three bits, but the tower isn’t perfect. It’s designed by engineers who have flaws themselves. There is some wastage and it is found that the tower actually sends you five bits on average. This is cross entropy. We are comparing the true average and the satellite’s current average entropy is three bits, but cross entropie’s five bits. This means that we could have had a system that tells us the weather with just three bits, but we have a system. Currently that is our satellite that is using five bits to do the same thing. Ideally, we want these numbers to be much closer to each other. This two-bit difference is known as the KL. Or the Colback Liebherr divergence. This little satellite is actually similar to a model that we trained to predict the weather in machine learning as a classification problem and so in many classification problems, cross entropy and KL divergence are often used as loss functions to minimize another loss is the hinge loss typically used in support vector machines for classification tasks, minimizing this. We get a boundary that splits the data well and is as far away from every data point as possible that is. It maximizes the minimum margin from the data points. This loss penalizes data points, even if they are correctly labeled if they lie in this margin. I’ve made several overly mathematical videos on kernels and SVMs. Check it out! If you want to lower your self-esteem. I’m gonna wrap up this video With a paper discussion. We’ve taken a rough look at six common losses for classification and regression, but there are far more. Some better suited for certain problems. We have a set of points. We want to fit a regression line through squared loss. Does it decently well? We had outliers and try to fit it again. Doesn’t look too great anymore. So we try the pseudo. Hueber loss. And this gives us better results, But I’m not satisfied yet, so let’s try some other losses, so we have the Welsh loss results. Our trash, Giveme Mcclair loss. It fits this data better now. Cauchy’s loss. This fits the data even better. I like this, it’s nice that I found the loss function. I liked, but I found this by trial and error. Is there a way that it could have just used a loss function without trial and error and somehow arrived at the actual minimum that I wanted turns out that all these losses that I mentioned can be generalized to this equation by setting different values for alpha, which is a shape parameter. How do we add alpha into the mix, Though maximum likelihood estimation we maximize the likelihood of the probability distribution or minimize the negative log likelihood, so it becomes an adaptive loss. This technique is typically used to derive losses mathematically and this actually leads to some interesting results. Here are some examples of images when we let a variational auto encoder to determine a loss and generate images. They aren’t half-bad. The idea of an adaptive loss sounds like an amazing idea. Hope you all have a better idea behind loss functions, the differences between them and a sprinkle of research on adaptive loss functions and so we can avoid trial and error to determine the most appropriate losses. I have resources in the description below. If you like these videos, please subscribe to keep the lights up in my little apartment and I will see you soon. Bye.