Transcript:
You don’t need a ukulele to do statistics, but it makes it more fun, hello. I’m Josh stormer. And welcome to stat quest today. We’re going to talk about how to build use and evaluate random forests in our this stat quest builds on two stat quests that I’ve already created that demonstrate the theory behind random forests. So if you’re not familiar with it? Check them out. Just so, you know, you can download all the code that I describe in this tutorial using the link in the description below the first thing we do is load in ggplot2 so we can draw fancy graphs and when I do that, our prints out a little message, It’s no big deal, then we load. Cal Plot, which just improves some of Ggplot2 S default settings. It overwrites the GG save function. And that’s fine with me, So no worries here. The last library we need to load is random forests. Duh, so we can make random forests. It also prints out some stuff in red, but it’s no big deal. We can move on from here For this example, we’re going to get a real data set from the UCI machine learning repository. Specifically, we want the heart disease data set, so we make a variable called URL and set it to the location of the data. We want for our random forest, and this is how we read the data set into our from the URL. The head function shows us the first six rows of data. Unfortunately, none of the columns are labeled wha-wha so we named the columns After the names that were listed on the UCI website, the UCI website actually lists a whole lot of information about this data. So it’s worth checking out. If you haven’t done that already, hooray now when we look at the first six rows with the head function, things look a lot better. However, the stur function, which describes the structure of the data, tells us that some of the columns are messed up. Sex is supposed to be a factor where zero represents female and one represents male. Cp aka chest pain is also supposed to be a factor where levels 1 through 3 represent different types of pain and 4 represents. No chest pain. See a and Thal are correctly called factors, but one of the levels is question Mark when we need it to be an N/A. So we’ve got some cleaning up to do. The first thing we do is change. The question marks to n. As then just to make the data easier on the eyes, we convert the zeros in sex to F for female and the ones to M for male. Lastly, we convert the column into a factor, then we convert a bunch of other columns into factors since that’s what they’re supposed to be see. The UCI website or the sample code on the Stat Quest blog for more details since the CA column originally had a question mark in it rather than in a are thinks it’s a column of strings. We correct that assumption by telling our it’s a column of integers, and then we convert it to a factor. Then we do the exact same thing for Thal. The last thing we need to do to the data is make HD aka heart disease, a factor that is easy on the eyes here. I’m using a fancy trick with if-else to convert the zeros to healthy and the ones to unhealthy. We could have done a similar trick for sex, But I wanted to show you both ways to convert numbers to words once were done fixing up the data We can check that we’ve made the appropriate changes with the stir function. Sex is now a factor with levels. F and M and everything else looks good too. Hooray, we’re done with the boring part now. We can have some fun since we are going to be randomly sampling things. Let’s set the seed for the random number generator so that we can reproduce our results. Now we impute values for the NA S in the dataset with our F impute. The first argument to our F impute is HD tilde dot. And that means we want the HD aka heart disease column to be predicted by the data in all of the other columns. Here’s where we specify which data set to use in this case. There’s only one data set and it’s called data. Here’s where we specify how many random forests our? F Impute should build to estimate the missing values in theory. Four to six iterations is enough just for fun. I set this parameter editor equal to 20 but it didn’t improve the estimates. Lastly, we save the results. The data set with imputed values instead of n. As as data dot imputed after each iteration, our F impute prints out the out of bag OOB error rate. This should get smaller. If the estimates are improving since it doesn’t, we can conclude that our estimates are as good as they’re going to get with this method. Here’s where we actually build a proper random forest using the random forest function, just like when we imputed values for the NA S. We want to predict HD aka heart disease using all of the other columns in the data set. However, this time we specify data imputed as the data set. We also want random forests to return the proximity matrix. We’ll use this to cluster the samples at the end of the stat quest. Lastly, we save the random forest and associated data like the proximity matrix as model to get a summary of the random forest and how well it performed. We can just type model on the command prompt and then hit enter. Here’s what gets printed to the screen. The first thing is the original call to random forest. Next we see that the random forest was built to classify samples If we had used the random forest to predict weight or height, it would say regression, and if we had omitted the thing, the random forest was supposed to predict entirely, it would say unsupervised. Then it tells us how many trees are in the random forest. The default value is 500 later. We will check to see if 500 trees is enough for optimal classification. Then it tells us how many variables or columns of data were considered at each internal node classification trees have a default setting of the square root of the number of variables. Regression trees have a default setting of the number of variables divided by 3 Since we don’t know if 3 is the best value, well fiddle with this parameter later on. Here’s the out-of-bag oob error estimate this means that 83.5% of the oob samples were correctly classified by the random forest. Lastly, we have a confusion matrix. There were 141 healthy patients that were correctly labeled healthy. Hooray. There were 27 unhealthy patients that were incorrectly classified as healthy Boo. There were 23 healthy patients that were incorrectly classified unhealthy. Ooh, lastly, there were 112 unhealthy patients that were correctly classified unhealthy. Hooray to see if 500 is enough for optimal classification, we can plot the error rates here. We create a data frame that formats the error rate information so that ggplot2 will be happy. This is kind of complicated, so let me walk you through it. For the most part. This is all based on a matrix within model called DOT Rate. This is what the Earth Dot rate Matrix looks like. There’s one column for the out-of-bag error rate one column for the healthy error rate, ie, how frequently healthy patients are misclassified and one column for the unhealthy error rate, ie. How frequently unhealthy patients are misclassified. Each row reflects the error rates at different stages of creating the random forest. The first row contains the error rates after making the first tree. The second row contains the error rates. After making the first two trees, the last row contains the error rates after making all 500 trees. So what we’re doing here is making a data frame. That looks like this, there’s. One column for the number of trees. There’s one column for the type of error and one column for the actual error value. And here’s the call to ggplot. Bam, the blue line shows the error rate when classifying unhealthy patients, the green line shows the overall out-of-bag error rate. The red line shows the error rate when classifying healthy patients in general, we see the error rates decrease when our random forest has more trees if we added more trees with the error rate, go down further to test this hypothesis, we make a random forest with 1000 trees. The out-of-bag error rate is the same as before and the confusion matrix shows that we didn’t do a better job classifying patients and we can plot the error rates just like before double bam. And we see that the error rates stabilize right after 500 trees so adding more trees didn’t help, but we would not have known this unless we used more trees. Now we need to make sure we are considering the optimal number of variables at each internal node in the tree, we start by making an empty vector that can hold ten values, and then we create a loop that tests different numbers of variables at each step. Each time we go through the loop. I increases by one. It starts at one and ends after ten. In this line, We are building a random forest using AI to determine the number of variables to try at each step. Specifically, we are setting M try equals. I and I equals values between 1 and 10 This is where we store the out-of-bag error rate after we build each random forest that uses a different value for M try. This is a bit of complex code. Here’s what’s going on. Tamp top model contains a matrix called air dot rate, just like model did before, and we want to access the value in the last row and in the first column, ie, the out-of-bag error rate when all 1000 trees have been made now we can print out the out-of-bag error rates for different values for M try. The third value, corresponding to M try equals 3 which is the default in this case has the lowest out-of-bag error rate, so the default value was optimal, but we wouldn’t have known that unless we tried other values. Lastly, we want to use the random forests to draw an MDS plot with samples. This will show us how they are related to each other. If you don’t know what an. MDS plot is don’t freak out. Just check out the stat quest on it. We start by using the Dist function to make a distance matrix from 1 minus the proximity matrix. Then we run. Cmd scale on the distance matrix. Cmd scale stands for classical multi-dimensional scaling then we calculate the percentage of variation in the distance matrix that the X and Y axes account for again see the other stat quest for details. Then we format the data for Ggplot, and then we draw the graph with Ggplot Triple Bam. Unhealthy samples are on the left side. Healthy samples are on the right side. I wonder if patient 253 was misdiagnosed and actually has heart disease note. The X-axi’s accounts for 47% of the variation in the distance matrix, the y-axi’s only accounts for 14% of the variation in the distance matrix that means that the big differences are along the x-axis. Lastly, if we got a new patient and didn’t know if they had heart disease and they clustered down here, we’d be pretty confident that they had heart disease. [MUSIC] Hooray! We’ve made it to the end of another exciting stat quest. If you like this stat quest and want to see more of them, please subscribe, and if you have any suggestions for future stat quests, we’ll put them in the comments below until next time quest on.