Transcript:

In machine learning classification problems are very common. The input is a set of features and the output is a continuous value between 0 and 1 that we interpret as a probability, but in reality, these values aren’t necessarily reflective of true probability values now. A main question here is well. How does that even happen well? The primary reason is data imbalance in problems like fraud detection where we have a few positive samples. It makes sense to undersample the negative class or overweight, the positive class. If we want to train a model, we do this so that the model can detect these positive instances, but as a result, the probabilities returned by the model may be skewed higher and in order to make sure that the values reflect true probabilities. Then we need to calibrate the model and we’ll see this in code later on too now. How do we actually calibrate a model? Well, we train a model as normal as we usually would for a classification problem. We then pipe this output and use it to train. A calibration model with one feature and the corresponding label here would be the actual label for the original model 2. Now there are two main types of calibration methods that are widely used. One of them is plot regression. And then we have isotonic regression. Plot regression basically uses a logistic regressor and it is better suited for more simplistic cases, whereas isotonic aggression fits a more complex piecewise linear model and can be used to find more complex relationships. We can try either method and see how your model works and depending on the model and also how your data is as well now with that primer out of the way, Let’s get into some code. Alrighty, now, so let’s take a look at some code. I opened up a notebook. Um, installed sciki’t learn and imported a bunch of packages from scikit. Learn so first of all we have mate classification, which is used to create our classification data set. I’m just going to be creating dummy. Data calibrated classifier CV actually performs the calibration behind the scenes that I discussed. Previously calibration curve is a good way to visualize whether a model is actually calibrated or not, or how exactly how well A model is calibrated train test split used to split your data into training test sets, logistic aggression. This is the main model that we’re going to be using for our dummy data. Excuse me, Roc and AUC score. Just, um, it’s a metric that is just going to quantify how well the model is performing. Uh, Breyer. Score loss is kind of like calibration curve where calibration curve is good to visualize how well A model is calibrated visually via graph, whereas Briar score loss is good to determine how well a model is calibrated via just like a number, and then we have a bunch of other like common functions right over here, so let’s take the first case where we create a classification data set with, um, 10 000 samples and it’s a balanced data set, which means that well, there’s equal number of positive and negative labels and in this data set, we’re going to have 10 features. Um, all of them are going to be significant. They’re going to be important features now. I’m going to be splitting this up in this cell into train. Dev and Test sets with an 80 10 and 10 split we’re using the train set to train the model. The Dev set or the X Val and Y Val, we’re going to be using for calibrating the model and then the tests for actually testing the model and getting these results and you can see like just looking at the distribution of the labels in the train set. They’re pretty even so four. Since there’s ten thousand samples, eight thousand that in the train set one thousand evaluation set and one thousand test set. So in the eight thousand, We have four thousand four thousand, which is about right, Fifty fifty. So, okay, let’s first. Consider the calibrated model case so right now we’re just gonna fit a logistic regression model on the training data and then make some predictions why pred will be basically a list of probabilities. So if I look at the distribution of probabilities, um, you kind of see like, okay. Half of them are below 47 half of them are above 47 Which kind of makes sense that’s correct. Um, in this case? The ROC, the AUC is about 94.2 percent. Pretty good model, We’ll roll with it. The Briar’s score loss is 0.0922 now. Mathematically, the brighter score is the difference between the test, as well as the predicted probability squared. It’s just the average of those squares, and so as you can see like if it’s lower, then that means it’s better, basically, uh, and this little plot is going to be of the calibration curve Now, like I said before this plot just signifies how well a model is calibrated. Um, ideally, this should be. Um, it should be very similar to Y is equal to X. Uh, I’ll just explain what probability of positives and fraction of positives is so probability of positives. Each of these x-axis is kind of the the value of the label prediction probability and the y-axis is like how many of these labels are what percentage of these labels are. Actually, you know, positive labels which ideally should be equal. Um, a good way to, though understand what calibration curve really does behind. The scenes is kind of just to open the Github. Repo, which I have right over here of its implementation. Uh, so right now we passed in bins is equal to 10 and this default is five, so what it’s going to do is actually take the entire array of from, like zero to one and just segment it into ten equal parts, which is like zero to zero point. One is one bin zero point one to zero point. Two is another bin and so on and that happens. Let me actually put that up right here in the code. That happens right over here on this line 875 where we’re just creating equal bins. And then what we do later is that we’re going to take all of the 1000 evaluation examples and just do and just like, put it into the bins where it where the probability lies. So we have 1000 samples if it lies between 0 and 0.1 put in the first bin if it lies between 0 0.1 and 0.2 put in the second bin. And then what we do is we’re going to count Compute. Whatever was on the X and the Y axis so prob. True, right here is basically saying what fraction of in each case, like what was the fraction of samples that were of the positive class for every single bin? We’re gonna compute it and then prediction probability is like in every single bin. You know, each of them actually corresponds to a probability value that was returned by the model. What is the average for each of those bins now? In every case, they should be as close to each other as possible. So which is why you see like this ideally should be a straight line, right. Um, and when we actually look here, it kind of does look pretty straight. It almost is like a Y is equal to X, which means that the values that are returned by this logistic regression function over here are pretty good. Yeah, they do are representative of probabilities or pretty close to that. All right, so now that we have the uncalibrated model set, what happens if we calibrate this balanced data set so basically, what happens is we take CLF, which is the trained classifier, and then we pass it into this calibrated classifier cv and what we’re doing here is just saying. Hey, we’ve already pre-fi’t this model class CLF. So all we’re going to do is apply an isotonic regression on it and then calibrate the model and how we’re going to calibrate It is using the evaluation data, which is another set of like 1000 examples. And so what we’re doing here is we’re going to make the predictions right here with predict Proba and then you can see that the distribution of, like the predictions of the calibrated model are pretty similar to what we saw previously with the train. Set right up here, so you can see like before. It was like 0.47 was the median and now it’s like 0.5 which honestly isn’t much of a difference. AUC is kind of similar too, and the Briar Score is very comparable 0.092 which I think was the same. Previously, it was so basically calibration doesn’t really do too much here and yeah, we still get, you know, a curve. That’s very similar to Y is equal to X, and by the way. I think I have to correct myself. Real quick here, so this curve. When I think I mentioned that it was created by only a thousand samples of the valuation set. Um, that’s wrong. It was actually computed by a thousand samples of the test. Set, I’m not sure if I clarified that correctly. Oh, well, now, you know. This is computed from the test set, Uh, oh, and this is also computed from the test set as well because we are calibrating the model and then we are making predictions via a test set. We’re calibrating it, using the valuation set, though, but we’re making this plot via the test set. I think I’ve repeated myself three times there, but that’s okay as long as we all understand. Um, so, yeah, basically, calibration didn’t really do much to to this. Because it’s a well-balanced data set and logistic regression is pretty good at returning probabilities now. This is like kind of, like, an extra thing where you know. I I’ve just seen where not a non-so recommended approach of basically using your training set to also calibrate your model, Probably not the best approach again because you’re training and calibrating at the same time this may lead to certain biases. But I’ve seen it in certain tutorials out there, so I’m throwing it out here anyways. But it’s good to at least refer all right now, moving on to the unbalanced dataset case. Now these are cases like you know, the case of fraud data where you might have only a few cases of fraud but like an abundance of just normal transactions that occur, so in this case, I’m also creating 10 000 examples with 10 features, all of them significant and we have of these 10 000 1000 of them are positive and the other nine thousand of them are negative samples and right here I’m kind of doing a split of train test and evaluation again. Eighty ten, ten and you can see here. We still have like a one is to nine. Uh, one is denied, uh, ratio, which kind of agrees with the weights that we’ve given so kind of representative of what we would see with, like, you know, fraud data so like that like we’ve done before for the balanced data set case, lets. Look at what happens if we pass this into an uncalibrated model, so we’ll basically pass into a logistic regression and what we’re doing here is. We’re passing in a parameter called. Class weight is equal to balanced. What this does is that you know, Because it’s an unbalanced data set, the positive labels are going to be weighted, like nine times more than that of the negative labels or I should say the negative examples. So yeah, this is done so that the model is better able to pick up on these positive examples and is also kind of a requirement so once we’re done there, we’ll fit the model. We’ll make predictions. We see that the aoc is like 90 Pretty good Briar Score 0.087 again. Okay, that’s fine. Um, and now when we describe the predictions, though looking at the predictions of just this uncalibrated model, we can see that like 50 of them are under 10 The prediction is under like 10 Okay, so this is just something to keep in mind, because lets we’ll be comparing it later to the calibrated model case, and you’ll see the difference in probabilities. So now if we were to just create the calibration curve on the test set, you can see that it’s very deviated now from the, um, from from Y is equal to X from that straight diagonal line, so this is indicative. When I look at this plot, I see, okay. The model is really not that calibrated, which means that these probability values that we see that are being returned in, um, y print DF are actually not very representative of probabilities. So now, okay, it becomes pretty apparent. So what do we do here? Let’s try to calibrate the model, so we have our classifier again. And we pass it to our calibrator calibrated classifier CV. We calibrate the model with the valuation data set, and then we make predictions. Now we have an AUC, That’s not too different from before, But look at our Briar score. It’s now 0.05 which is definitely better than the 0.08 that we saw previously, which is good and now when you kind of look at the predictions right over here, the kind of predictions that are basically returned by our calibration model are like 1.5 that’s the median before the median was 10 so you can see that the probability values have now completely decreased compared to what they were in the uncalibrated case. And this is kind of what I hinted at back in the explanation before I showed all this code now. These should be more representative true probabilities. Why is that the case well? If we look at this calibration curve right now you can see. It’s much closer to a Y is equal to X, and so these values are actually more representative of true probabilities, and yeah, and this is just like the same case. What I mentioned before where we’re like, training and evaluation happens on using this, just like one set of data at the same time one shebang. So yeah, that’s kind of all about, like model calibration and an interesting place where you would use. This is like anywhere where, like you really need absolute probability values to be representative of actual probability values kind of like, you know, an expectation problems when you’re finding the expected value of perhaps, you know, one of your features and this, actually, I have illustrated very detailed in another video on expectations, So I think it was a video that came out before this, So if you want to check out a cool implementation, using probabilities and calibration. I suggest you check that video out other than that. Though I have some references down in the description below or either and rather actually right here on the end of this notebook and this code will be available on Github Link, also in the description below. So yeah, just please comment like subscribe. Do everything you need to do to get the word out trying to grow a good channel here, so stay tuned. Stay safe and I’ll see you later. Bye bye!