Transcript:
Wait, soon you will see ROC and AUC they are cool, yeah StatQuest !!! Hello, I’m Joshua Starmer welcome to StatQuest Today we’ll talk about ROC and AUC and explain them in an accessible way. Note: This StatQuest is based on the bug and specificity matrix videos, so if you haven’t watched them check it out! Also, the example I give in this video is based on logistic regression, so while ROC and AUC are not only applied in logistic regression, make sure you understand the basics. Let’s start with the data The y-axis displays two categories Obese and non-obese Blue dots indicate obese mice Red dots show non-obese mice The x-axis is the weight This mouse is non-obese, although it has a large mass. Probably a Supermouse with a mountain of muscles This mouse doesn’t weigh that much, but still considered obese according to its size Now let’s describe the logistic regression curve data! When we apply logistic regression, the y-axis begins to display the likelihood that the mouse is obese Now let’s just look at the curve If anyone told us that he has a heavy mouse that weighs so much … The curve would then tell us that the odds of the mouse being obese are high. If anyone told us he has a light mouse, which weighs so much Then the curve would tell us that the likelihood of obesity in a mouse is low. So, this logistic regression shows us the likelihood of an obese mouse based on its mass. However, if we want to classify mice as obese or non-obese, we need a way to turn probability into classification. One way to classify mice is to set the threshold to 0.5 and count all mice with a probability of obesity greater than 0.5 as obese and count all mice with a probability of obesity less than or equal to 0.5 as non-obese Using 0.5 as a cut-off point, we would call this mouse obese. And this one is not obese If another mouse weighs that much, we would classify her as obese And if the other weighed so much, we would classify her as non-obese To evaluate the performance of this logistic regression with a cut-off point of 0.5, we will check it on a mouse that we know about, is obese or not Here are the masses of four new mice, about which we know that they are not obese. But the weights of the four new mice we know about that they are obese We know these mice are non-obese And Threshold Logistic Regression at level 0.5 correctly classifies her as non-obese This mouse is also correctly classified And this mouse is misclassified We know she is obese, but she classified as not obese The next mouse is correctly classified But this mouse is misclassified The last three mice are classified correctly Now we create the error matrix to summarize the classification results These three samples are correctly classified like obesity And this sample was predicted to be obese, although he didn’t have it These three samples were correctly classified as not obese And for this sample, no obesity was predicted, although obesity was When the error matrix is built, we can calculate the sensitivity and specificity to evaluate this logistic regression, when the threshold for diagnosing obesity is set at 0.5 Little boom because for now this is just a description Now let’s talk about what happens when you set a different threshold for obesity. For example, if it is very important to correctly classify every obese sample, we can set the threshold to 0.1 This will correctly classify all 4 obese mice. But it also increases the number of false positives. A lower threshold would reduce the number false negative results because all obese mice are correctly classified Note: if the idea of using a threshold other than 0.5 strikes you, imagine that instead of classifying samples of obese and non-obese we would classify samples as Ebola and not infected with Ebola In this case, it is absolutely paramount to correctly classify each Ebola-infected specimen. to minimize the risk of an epidemic And this means lowering the threshold value, even if it causes an increase in false positives Alternatively, we could set the threshold to 0.9 In this case, we would correctly classify the same number of obese samples as in the case of the 0.5 threshold. But we wouldn’t have false positives And we would correctly classify another sample as no obesity And would have had the same number of false negative results as before With this data, a higher threshold allows for better classification of samples as obese and non-obese. But the threshold can be set to any level from 0 to 1 How do we determine which threshold is the best? For starters, we don’t need to check every possible threshold. For example, at these thresholds, the error matrix is the same But even if we make one error matrix for each significant threshold, there will be too many error matrices causing confusion So, in order not to suffer with error matrices, you can build a ROC-curve (characteristics of the receiving operator) which graphically summarizes all the information The y-axis shows the proportion of true positive results, this is the same as sensitivity The proportion of true positive results is number of true positives divided by the sum of true positives and false negatives In this example, true positives are samples who are correctly classified as obese And false negatives are obese samples that are incorrectly classified as non-obese The proportion of true positive results shows proportion of correctly classified obese samples The x-axis shows the proportion of false positives, which is the same as 1 minus specificity The false positive rate is number of false positives divided by the sum of false positives and true negative False positives are non-obese samples misclassified as obese samples True negative are samples that are correctly classified as samples without obesity False-positive rate shows the proportion of misclassified samples without obesity To better understand how the ROC curve works, draw one from start to finish for our sample data Let’s start with a threshold that classifies all samples as obese In this case, the error matrix is as follows: First, calculate the proportion of true-positive results Here are 4 true positive results and zero false negative Calculations give 1 The proportion of true positives when the threshold is so low that each sample is classified as obese is 1 This means that every obese sample is classified correctly. Now let’s calculate the proportion of false positives There were 4 false positives in the error matrix and zero true negative Calculations give us 1 The proportion of false positives when the threshold is so low that each specimen is classified as obese, also 1 This means that every non-obese sample is misclassified as obese. Now let’s designate a point with coordinates (1,1) The point (1,1) means that even if we correctly classified all obese samples, we also misclassified all non-obese samples This green diagonal shows where the proportions of true and false positives are equal Any point on this line means that the proportion of correctly classified obese samples is the same. as well as the proportion of misclassified samples without obesity Coming back to logistic regression, let’s increase the threshold so that all but the lightest samples are considered obese The new threshold gives us the following error matrix Calculate the proportion of true-positives and the proportion of false-positives and mark the point (0.75, 1) Because new point (0.75, 1) to the left of the green line, we know that the proportion of correctly classified samples with obesity (true positive) is greater than the proportion of samples incorrectly classified as obese (false positive) In other words, the new threshold for deciding whether a sample is obese is better than the first. Now let’s increase the threshold so that all but the two lightest samples are considered obese. The new threshold gives us the following error matrix Then we calculate the proportion of true-positive results. and false positives and put a point on (0.5, 1) The new point is even to the left of the green line, i.e. at the new threshold, the proportion of samples incorrectly classified as obese (false-positive) decreases In other words, the new threshold is the best so far. Now let’s increase the threshold again Create an error matrix We count the proportion of true and false positive results and put a full stop We increase the threshold again create a matrix of errors count the proportion of true and false positive results and plot a point on the graph At the threshold at the new point (0, 0.75), 75% of obese samples and 100% of non-obese samples are correctly classified In other words, at this threshold, no sample becomes false positive. Now we increase the threshold again and put a point Now we increase the threshold again and put a point Finally, we choose a threshold that classifies all samples as non-obese samples. and put a point Point (0, 0) represents the threshold at which there are no true or false positives If we want we can connect the dots and this will be a graph of the ROC curve ROC curve sums all error matrices obtained at each threshold No need to classify using error matrices I can say that this threshold better than this And given the number of false positives I accept, the optimal threshold is either this or that. Bam! Now we know what a ROC curve is, let’s talk about the area under it (AUC, area under the curve) Area under the curve – 0.9 Bam? The area under the curve makes it easier to compare two ROC curves The area under the curve for the red ROC curve is larger than for the blue one, i.e. red curve is better So, if the red curve represents logistic regression and the blue one represents random forest, you would use logistic regression Double bam! Now, the last one for today Although ROC curves are drawn using true and false positive rates, to generalize error matrices there are other metrics that try to calculate the same For example, people often trade false positives for accuracy. Precision is the ratio of the number of true positives to the sum of true and false positives. Accuracy is the proportion of correctly classified positives If there were many non-obesity samples compared to the number of obese samples, then the accuracy might be more useful than the false-positive rate Because accuracy does not take into account the number of true negative results in calculations, and is not affected by imbalance In practice, this imbalance occurs when studying a rare disease. In this case, the study will involve many more people without the disease than people with the disease. Bam! Summary ROC curves make it easier to choose the best threshold for decision making This threshold is better than this And the area under the curve can help you decide which categorization method is best. Red is better than blue Hurray, we’ve reached the end of one more exciting StatQuest If you like StatQuest and want to see more, please subscribe. And if you want to support StatQuest consider purchasing a T-shirt, hoodie or buying one or two of my author’s songs, link below Until next time request has been sent!