Transcript:

What’s up, everybody? Welcome back to my Youtube channel. Richard, on data. If this is your first time here, my name is Richard. And this is the channel where we talk about all things, Data, data, science, statistics and programming subscribe for all kinds of content just like this and hit the notification Bell, so Youtube notifies you whenever I upload a video today, we’re going to talk about random forests. So this is the Amazon rainforest. It’s one of the most well-known forests in the world. It stretches across a few different countries in South America, but most of it’s in Brazil. This is the crooked forest. It’s in Poland and I must say I have never seen Pine trees that look like that like upside down. Question, Mark. Kind of things, that’s amazing. I didn’t even know that was possible. Then this is the goblin forest. This one’s located in New Zealand and legend, has it. This one is full of Orcs who, right now as we speak, They’re plotting their way into Middle Earth, so they can commit a horrible series of devious village raids, Im. Sorry, that was just too easy. So in all seriousness, random forests are probably my personal favorite machine learning algorithm of all time, and they are one of the most popular ones overall, but they’re not magical. They are not some tool that you can just throw at any possible problem and then bam profit. They do have a reputation for not quite being a black box, sort of thing that something like a neural network is but being significantly less interpretable than something like a regression model or even a decision tree more on that shortly because random forests can still be one of the most interpretable of the machine learning methods, at least in the context of explaining the work that you’ve done to non-technical audiences now as a disclaimer, I’m going to give a brief overview of what exactly random forests are, but this is not a video to provide all of the detailed methodology behind them. That’s probably the subject for a different video for now. I’ll have a link in the description to a video by statquest with Josh Starner. I think his video does a really nice job of breaking random forests down and I’ll also provide a link in the description to some documentation by the guys who wrote the modern implementation of the random forest algorithm. The focus of this video is on the practical instances in which random forests can help you out as well as when they may miss the mark a little bit and we’re going to talk about this within the context of two separate goals and those are inference and prediction. Now if you’re unfamiliar with the distinction between these two terms. I did a video on this topic. I’ll have a card for it up above as well as a link in the description to that too. So for those of you who are totally unfamiliar with random forests, Here’s a short overview. The key building block of a random forest. Classifier is the decision tree. And these are a classic method for classification or regression problems, but in particular, these enjoy a little bit of popularity just because they’re super easy to interpret. Here’s an example of a decision tree on the right and the decision we’re trying to make here is play tennis or not. You got various decisions that you make at each node and overall, you end up with something that’s totally clear and very easy to understand even to a non-technical audience, for example, lets. Start at the top. We ask what the weather outlook is, lets. Suppose it’s sunny? You go down the tree. You ask what the humidity is? If it’s high, we say no. We’re not going to play tennis. If it’s normal, we say yes, we are going to play tennis and so on and so forth. Well, the problem with decision trees, though, is that they tend to have high variance that is they tend to overfit to the training set, and then they don’t generalize very well, Random forests attempt to correct that problem by ensembling over many different trees. You create a bootstrapped dataset, retaining only a subset of all the available variables. Then you fit a decision tree based on that data set and that tree like any other provides the rules for predicting the outcome, then repeat that process many times and you tally up the votes for what the individual trees are predicting your outcome to be the outcome with the most votes at the end. Is the random forest’s prediction? So at the end of the day, you’ve got a robust and very nicely performing classifier that doesn’t suffer from the individual decision. Trees Overfitting problem now. One awesome feature of random forests is that they also provide you a ranking of the importance of all your variables. Then you can generate a variable importance plot, which looks like this now. This is basically telling us. RM is the most important feature, followed by Lstat, followed by DIS and so on and so forth. Now, let’s just pretend for a second that you’re working on a project and your goal is inference that is you want to understand the variables that have the highest impact on the response? We’ll, guess what the variable importance plot from a random forest will help you in that goal. That variable importance plot we just generated based on this data tells us that by a fairly wide margin, RM and Lstat are the two most important features in this data. Now you might be saying at this point, But random forests aren’t interpretable the same way that a regression model is you would be correct. However, here’s what’s important to understand so all models, whether they’re statistical or machine learning, carry with them certain assumptions and they’re going to have certain advantages and disadvantages. Here’s just an example regression models are only equipped to detect linear relationships, decision trees in random forests on the other hand can detect non-linear relationships. Now let’s say you ran a regression model, and you found that RM was the most important variable. Then you ran a random forest. You created this variable importance plot, and it’s also telling you that RM is the most important variable well, from a practical standpoint. You have a lot of evidence now that that is a very robust result because now two different models, which are very different, are effectively giving you back the same finding, but then let’s just say you ran that regression model, and it told you age was the single most important variable well. Age is the eighth variable down the list in the variable importance plot from the random forest. So you might just have learned something that prevented you from making too strong of a conclusion about the impact age actually has on your response, so I’m gonna level with all of you and say Random Forest is actually a really helpful tool for inference and not just for prediction. Now you may say it’s still kind of like a black box and most people don’t understand the math behind it, and you’d mostly be right, but I would tell you that just from my own personal experience. Most people do intuitively get the idea behind a decision tree, and then it’s fairly easy to explain after that that a random forest is essentially just an ensemble of various different decision trees using re-samples of your data. Not everyone will understand the math the way that they might with regression models, but I’ll take explaining that logic any day over explaining neural networks to non-technical types. So now that I’ve hopefully convinced you that random forests can be pretty useful from the standpoint of inference, Let’s come back and talk about the conditions when they work really well as well as when they don’t work quite so well, for starters, one good point about random forests is they’re not the most sensitive to outliers. Now I think it’s unfair to just say outright that they’re robust to outliers because that’s maybe a little too strong of a statement, but the intuition for this is in the base learner of the random forest. That is the decision tree. So in the fitting process for these outliers typically get isolated into pretty small leaves of the tree, and then decision trees operate locally. So what you end up with is an algorithm that isn’t totally robust necessarily, but because of that combination of two effects where you have the multiple trees and aggregation across them as well as the local modeling that occurs just the design of the algorithm makes it so that it’s not as dramatically impacted by outliers as some others are. Another positive of random forest is that the algorithm is pretty stable meaning. If you add a small amount of new data, you’re very unlikely to get radically different results, either in the variable importance rankings or the predictions, and this should make sense on an intuitive level because while the new data will affect trees on an individual level once you start ensembling over a number of trees, particularly over a number of trees, which aren’t even going to be affected by the new data at all, you’re just not going to change the overall forest all that much. Another good feature about random forests is they can deal with missing data and there are two distinct methods. They have for doing so option. One is really straightforward. If you have missing continuous data, it fills it in with the median for that class. Then if you have missing categorical data, it fills it in with the most common level for that class. That’s it option. Two is a little better. Basically, it does a rough fill-in of the missing values. Then it runs the forest. Then it computes the proximity weighted average in the case of continuous variables or the proximity, weighted most frequent non-missing value for categorical variables. This process is repeated. Several times, usually only about four to six times more on this and a link in the description from Leo Bremen and Adele Cutler, who are the inventors of the modern random forest algorithm. All of that to say, random forests do have solid ways of estimating missing data. And you’re probably going to get pretty good performance, even if that’s a problem that you’re facing with your data and just as a side note and little known fact because the random forest can calculate proximities between observations, it can be used for unsupervised clustering and again. This is detailed more by the creators of the algorithm and that link is in the description. Also in the description, I’ll include a link to an abstract on this very application, but it’s just another example of how it is not appropriate to think of random forest as just another black box classification method. That’s a common misconception that I think really sells short. How beautiful and how powerful the algorithm is now. Obviously, random forests are not perfect and they do come with some shortcomings, namely, for one thing, random forests are more memory intensive and they’re going to take longer to train and make predictions with compared to decision trees or even some other machine learning algorithms. Intuitively, that should make sense, random forests do involve many trees so we would expect them to take longer than just one tree. This should not be much of a problem. As far as your number of covariates is concerned, but this can start to creep up and become a problem. If you have a data set with a lot of observations, then also the approach of using random forests for variable importance is not perfect either. And in fact, there are a couple things which can trip this up by causing bias. One of those is when you have a data set that has a mix of continuous predictors, as well as categorical predictors, particularly when those categorical predictors have a fairly small number of levels. Well, in this instance, the random forest variable importance is going to be biased in favor of categorical variables with more levels. The other instance is when you have some predictors that are heavily correlated. This is going to cause you some problems. In particular, it’s probably going to make it so that some variables appear important when in reality, they’re not luckily there is a technique for mitigating this problem, and it’s through growing unbiased conditional inference trees as usual links to all that fun stuff in the description and there’s ways to implement this approach in R. You can use the C forest function from the party package. The downside is though, let’s just say. This approach is not the fastest car on the race track. If you get what I mean, so hopefully by now, you get the idea with random forests. They’re a beautiful algorithm and I’ve personally had a great experience with them. No matter what I was trying to do whether it was confirming a pre-existing hypothesis about important variables. Maybe it was classifying a test set. I just had a great time with them, but like anything, they’re not perfect and hopefully now that you’ve watched this video, You have an idea of when they’re well suited for the problem, and when they’re not so thanks for watching this video if you enjoyed it and you’d like to support my work. The most helpful thing that you could do for me would be to share this video. Otherwise, at least consider smashing the like button and also let me know down below when you’ve used random forests and how they’ve worked for you. Then I’ll see you all in the not so distant future until then, Richard on data.