Transcript:
Step quest, gettin freaky. Step quest, kind of sneaky step quests. Hello, I’m Josh Starla. And welcome to stat quest today. At long last, we’re gonna cover logistic regression in our note. A link to the code, which is chock full of comments and should be easy to follow is in the description below for this example, we’re going to get a real data set from the UCI machine learning repository. Specifically, we want the heart disease data set note. This is the same data set we used when we made random forests in our if you’re familiar with that data, you can skip ahead to about 3 minutes and 44 seconds in this video, we start by making a variable called URL and set it to the location of the data we want. This is how we read the data set into our from the URL. The head function shows us the first six rows of data. Unfortunately, none of the columns are labeled wah-wah, so we named the columns after the names that were listed on the UCI website. Hooray now when we look at the first six rows with the head function. Things look a lot better, However, the stirrer function, which describes the structure of the data. Tell us that some of the columns are messed up right now. Sex is a number, but it’s supposed to be a factor where zero represents female and one represents male. Cp aka chest pain is also supposed to be a factor where levels 1 through 3 represent different types of pain and 4 represents. No chest pain. CA and Thal are correctly called factors, but one of the levels is question Mark when we need it to be in a so, we’ve got some cleaning up to do. The first thing we do is change. The question marks to n. As then just to make the data easier on the eyes, we convert the zeros in sex to F for female and the ones to M for male. Lastly, we convert the column into a factor, then we convert a bunch of other columns into factors since that’s what they’re supposed to be see. The UCI website or the sample code on the Stat Quest blog for more details since the CA column originally had a question mark in it rather than in a are thinks it’s a column of strings. We correct that assumption by telling our that it’s a column of integers, and then we convert it to a factor. Then we do the same thing for. Thal, the last thing we need to do to. The data is make HD aka heart disease, a factor that is easy on the eyes here. I’m using a fancy trick with the if-else function to convert the zeros to healthy and the ones to unhealthy. We could have done a similar trick for sex, But I wanted to show you both ways to convert numbers to words once we’re done fixing up the data We check that we have made the appropriate changes with the stir function. Hooray, it worked now! We see how many samples rows of data have na values later. We will decide if we can just toss these samples out or if we should impute values for the. Na s6 samples. Rows of data have N As in them. We can view the samples within a Z by selecting those rows from the data frame. And there they are here are the? Na values, five of the six samples are male and two of the six have heart disease. If we wanted to, we can impute values for the NA S using a random forest or some other method, however, for this example, We’ll just remove these samples, including the six samples within As. There are three hundred three samples, Then we remove the six samples that have N As and after removing those samples, there are two hundred ninety seven samples remaining Scapa. Now we need to make sure that healthy and diseased samples come from each gender, female and male. If only male samples have heart disease, we should probably remove all females from the model. We do this with the X tabs function. We pass X tabs the data and use model syntax to select the columns in the data. We want to build a table from in this case. We want a table with heart disease and sex and bam. Healthy and unhealthy patients are both represented by a lot of female and male samples. Now let’s verify that all four levels of chest pain. Cp for short were reported by a bunch of patients. Yes, and then we do the same thing for all of the boolean and categorical variables that we are using to predict heart disease. Here’s something that could cause trouble for the resting electrocardiographic results, only for patients represent Level 1 This could potentially get in the way of finding the best fitting line, however, for now we’ll just leave it in and see what happens and then we just keep looking at the remaining variables to make sure that they’re all represented by a number of patients. Okay, we’ve done all the boring stuff. Now let’s do logistic regression. Let’s start with a super simple model. We’ll try to predict heart disease using only the gender of each patient. Here’s our call to the GLM function, the function that performs generalized linear models. First, we use formula syntax to specify that we want to use sex to predict heart disease. Then we specify the data that we are using for the model. Lastly, we specify that we want the binomial family of generalized linear models. This makes the GLM function do logistic regression as opposed to some other type of generalized linear model. Oh, I almost forgot to mention that. We are storing the output from the GLM function in a variable called logistic. We then use the summary function to get details about the logistic regression Bam [Music] the first line has the original call to the GLM function. Then it gives you a summary of the deviance residuals. They look good since they are close to being centered on zero and are roughly symmetrical. If you want to know more about deviance residuals, check out. The stat quest deviance residuals clearly explained. Then we have the coefficients they correspond to the following model. Heart disease equals negative, one point zero four, three, eight plus one point two seven three, seven times. The patient is male, the variable. The patient is male is equal to zero when the patient is female and one when the patient is male. Thus, if we are predicting heart disease for our female patient, we get the following equation. Heart disease equals negative, one point zero four, three, eight plus one point two seven three seven times zero. This reduces to heart disease equals negative, one point zero four, three eight, Thus the log odds that a female has heart disease equals negative, one point zero four, three eight. If we were predicting heart disease for a male patient, we get the following equation. Heart disease equals negative one point zero four, three, eight plus one point, two seven three seven times one and that reduces to heart disease equals negative, one point zero four, three, eight plus one point, two seven three seven since this first term is the log odds of a female having heart disease. The second term indicates the increase in the log of the odds that a male has of having heart disease. In other words. The second term is the log of the odds ratio of the odds that a male will have heart disease over the odds that a female will have heart disease. This part of the logistic regression output shows how the Walds was computed for both coefficients and here are the P values. Both P values are well below 0.05 and thus the log of the odds and the log of the odds ratios are both statistically significant, but remember, a small P value alone isn’t interesting. We also want large effect sizes, and that’s what the log of the odds and the log of the odds ratio tells us if you want to know more details on the coefficients and the Wold test, check out the following stat quest’s, odds and the log odds clearly explained odds ratios and log odds ratios clearly explained in logistic regression details. Part 1 coefficients. Next we see the default dispersion parameter used for this logistic regression. When we do normal linear regression, we estimate both the mean and the variance from the data, In contrast with logistic regression, we estimate the mean of the data in the variance is derived from the mean since we are not estimating the variance from the data and instead just deriving it from the mean it is possible that the variance is underestimated. If so, you can adjust the dispersion parameter in the summary command, Then we have the null deviance and the residual deviance these can be used to compare models compute. R squared in an overall p value for more details. Check out the stat quests, logistic regression details, Part 3 R squared and its p-value in saturated models and deviance statistics clearly explained. Then we have the AIC. The Chi key information criterion, which in this context is just the residual deviance adjusted for the number of parameters in the model, the AIC can be used to compare one model to another. Lastly, we have the number of Fisher scoring iterations, which just tells us how quickly the GLM function converged on the maximum likelihood estimates for the coefficients. If you want more details on how the coefficients were estimated, check out the stat quest logistic regression details, part two fitting a line with maximum likelihood. Double bam. Now that we’ve done a simple logistic regression. Using just one of the variables sects to predict heart disease, we can create a fancy model that uses all of the variables to predict heart disease. This formula syntax. HD tilde. Dot means that we want to model heart disease HD. Using all of the remaining variables in our data frame called data. We can then see what our model looks like with the summary function. Dang, the summary goes off the screen, no worries. We’ll just talk about a few of the coefficients. We see that age isn’t a useful predictor because it has a large p-value. However, the median age in our dataset was 56 So most of the folks were pretty old, and that explains why it wasn’t very useful. Gender is still a good predictor, though. If we scroll down to the bottom of the output, we see that the residual deviance and the AIC are both much smaller for this fancy model than they work for the simple model when we only use gender to predict heart disease. If we want to calculate Mcfadden’s pseudo r-squared, we can pull the log likelihood of the null model out of the logistic variable by getting the value for the null deviance and dividing by negative two, and we can pull the log-likelihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by negative 2 Then we just do the math, and we end up with a pseudo. R squared equals 0.55 This can be interpreted as the overall effect size, and we can use those same log likelihoods to calculate a p-value for that r-squared using a chi-squared distribution. In this case, The P-value is tiny, so the r-squared value isn’t due to dumb luck. One last shameless self-promotion. More details on the r-squared and P-value can be found in the following stat quest logistic regression details, Part 3 R squared and its p-value. Lastly, we can draw a graph that shows the predicted probabilities that each patient has heart disease along with their actual heart disease status. I’ll show you the code in a bit. Most of the patients with heart disease. The ones in turquoise are predicted to have a high probability of having heart disease and most of the patients without heart disease. The ones in salmon are predicted to have a low probability of having heart disease. Thus, our logistic regression has done a pretty good job. However, we could use cross-validation to get a better idea of how well it might perform with new data, but we’ll save that for another day to draw the graph. We start by creating a new data frame that contains the probabilities of having heart disease along with the actual heart disease status. Then we sort the data frame from low probabilities to high probabilities. Then we add a new column to the data frame that has the rank of each sample from low probability to high probability. Then we load the ggplot2 library so we can draw a fancy graph. Then we load the Cal Plot library so that Ggplot has nice-looking to false. Then we call GG plot and use G on point to draw the data, and Lastly, we call GG. Save to save the graph as a PDF file. Triple bam hooray! We’ve made it to the end of another exciting stat quest. If you like this stat quest and want to see more of them, please subscribe. And if you want to support stat quest, well, please click the like button below and consider buying one or two of my original songs. Alright, until next time quest on.