Transcript:

[MUSIC] Gradiant Buspar Tuan regression main ideas, stat quest. Hello, I’m Josh Stormer. And welcome to stat quest. Today we’re going to talk about the gradient boost machine. Learning algorithm specifically we’re going to focus on how gradient boost is used for regression note. This stat quest assumes you already understand decision trees. So if you’re not already down with those check out the quest, this stat quest also assumes that you are familiar with adaboost and the trade-off between bias and variance. If not, check out the quests, the links are in the description below this stat quest is the first part in a series that explains how the gradient boost machine learning algorithm works specifically. We’ll use this data where we have the height measurements from six people, their favorite colors, their genders and their weights and we’ll walk through step by step the most common way that gradient boost fits a model to this training data note when gradient boost is used to predict a continuous value like weight. We say that we are using gradient boost for regression using gradient boosts for regression is different from doing a linear regression so while the two methods are related. Don’t get them confused with each other part two in this series, we’ll dive deep into the math behind the gradient boost algorithm for regression, walking through it step-by-step and proving that what we cover today is correct. Part Three in this series shows how gradient boost can be used for classification, specifically, we’ll walk through step by step. The most common way gradient boost can classify someone is either loving the movie troll 2 or not loving troll 2 part 4 well return to the math behind gradient boost this time focusing on classification walking through it. Step-by-step note, the gradient boost algorithm looks complicated because it was designed to be configured in a wide variety of ways, but the reality is that 99% of the time only one configuration is used to predict continuous values like weight and one configuration is used to classify samples into different categories. This stack quest focuses on showing you the most common way gradient boost is used to predict a continuous value like weight. If you are familiar with adaboost, then a lot of gradient boost will seem very similar, so let’s briefly compare and contrast at a boost and gradient boost if we want to use these measurements to predict weight. Then adaboost starts by building A very short tree called a stump from the training data, and then the amount of say that the new stump has on the final output is based on how well it compensated for those previous errors. Then Adaboost builds the next stomp, based on errors that the previous stump made in this example. The new stump did a poor job compensating for the previous stumps errors and its size reflects it’s reduced amount of say, Then Adaboost builds another stump based on the errors made by the previous stump. And this stump did a little better than the last stop. So it’s a little larger. Then Adaboost continues to make stumps in this fashion until it is made the number of stumps You asked for or it has a perfect fit. In contrast, gradient boost starts by making a single leaf instead of a tree or stump. This leaf represents an initial guess for the weights for all of the samples when trying to predict a continuous value like weight. The first guess is the average value, then gradient boost builds a tree like adaboost. This tree is based on the errors made by the previous tree. But unlike adaboost, this tree is usually larger than a stump that said gradient boost still restricts the size of the tree in the simple example that we will go through in this stat quest, we will build trees with up to 4 leaves, but no larger. However, in practice, people often set the maximum number of leaves to be between 8 and 32 Thus like adaboost gradient boost builds fixed size trees based on the previous tree’s errors, but unlike adaboost, each tree can be larger than a stump, also like adaboost gradient boost scales, The trees, however, gradient boost scales all trees by the same amount, then gradient boost builds another tree based on the errors made by the previous tree, and then it scales the tree and gradient boost continues to build trees in this fashion until it has made the number of trees you asked for or additional trees fail to improve the fit. Now that we know the main similarities and differences between gradient boost and adaboost, let’s see how the most common gradient boost configuration would use this training data to predict weight. The first thing we do is calculate the average weight. This is the first attempt at predicting everyone’s weight in other words. If we stopped right now, we would predict that everyone weighed 70 1.2 kilograms, however, gradient boost doesn’t. Stop here the next thing we do is build a tree based on the errors from the first tree. The errors that the previous tree made are the differences between the observed weights and the predicted weight 70 1.2 So let’s start by plugging in seventy one point, two for the predicted weight and then plug in the first observed. Wait and do the math and save the difference, which is called a pseudo residual in a new column note. The term pseudo residual is based on linear regression, where the difference between the observed values and the predicted values results in residuals. The pseudo part of pseudo residual is a reminder that we are doing gradient boost. Not linear regression and is something. I’ll talk more about in part 2 of this series when we go through the math. Now we do the same thing For the remaining weights now. We will build a tree using height, favorite color and gender to predict the residuals. If it seems strange to predict the residuals instead of the original weights, just bear with me and soon all will become clear, so setting aside the reason why we are building a tree to predict the residuals for the time being. Here’s the tree, Remember? In this example, we are only allowing up to four leaves, but when using a larger data set, it is common to allow anywhere from 8 to 32 by restricting the total number of leaves we get fewer leaves than residuals as a result. These two rows of data go to the same leaf, so we replace these residuals with their average and these two rows of data go to the same leaf, so we replace these residuals with their average now we can combine the original leaf with the new tree to make a new prediction of an individual’s weight from the training data we start with the initial prediction. Seventy one point two. Then we run the data down the tree, and we get sixteen point eight, so the predicted weight equals Seventy one point, two plus sixteen point eight, which equals eighty eight, which is the same as the observed weight. Is this awesome? No, the model fits the training data too well. In other words, we have low bias, but probably very high variance gradient boost deals with this problem by using a learning rate to scale the contribution from the new tree. The learning rate is a value between zero and one in this case. We’ll set the learning rate to 0.1 now. The predicted weight equals Seventy one point two plus zero point one time’s sixteen point eight, which equals seventy two point nine with the learning rate set to 0.1 The new prediction isn’t as good as it was before, but it’s a little better than the prediction made with just the original leaf, which predicted that all samples would weigh Seventy one point two, in other words, scaling the tree by the learning rate results in a small step in the right direction, according to the dude that invented gradient boost. Jerome Freedman. Empirical evidence shows that taking lots of small steps in the Right Direction results in better predictions with a testing data set, ie lower variance. Bam, so let’s build another tree so we can take another small step in the right direction, just like before. We calculate the pseudo residuals the difference between the observed weights and our latest predictions, so we plug in the observed weight and the new predicted weight and we get 15.1 and we save that in the column for pseudo residuals. Then we repeat for all of the other individuals in the training data set. Small BAM note. These are the original residuals from when our prediction was simply the average overall weight and these are the residuals after adding the Nutri scaled by the learning rate. The new residuals are all smaller than before, so we’ve taken a small step in the right direction. Double bam! Now let’s build a new tree to predict the new residuals. And here’s the Nutri note. In this simple example, the branches are the same as before, however, in practice, the trees can be different each time just like before. Since multiple samples ended up in these leaves, we just replace the residuals with their averages. Now we combine the new tree with the previous tree and the initial leaf note we scale all of the trees by the learning rate, which we set to 0.1 and add everything together. Now we’re ready to make a new prediction from the training data just like before we start with the initial prediction, then add the scaled amount from the first tree in the scaled amount from the second tree that gives us seventy one point two plus zero point one times sixteen point eight plus zero point one times fifteen point one, which equals seventy four point four, which is another small step closer to the observed weight. Now we use the initial leaf, Plus the scaled values from the first tree, plus the scaled values from the second tree to calculate new residuals, remember? These were the residuals from when we just use a single leaf to predict weight and these were the residuals after we added the first tree to the prediction and these are the residuals. After we added the second tree to the prediction each time we add a tree to the. Bur diction. The residuals get smaller, so we’ve taken another small step towards making good predictions. Now we build another tree to predict the new residuals and add it to the chain of trees that we have already created and we keep making trees until we reach the maximum specified or adding additional trees does not significantly reduce the size of the residuals. Bam, then when we get some new measurements, we can predict weight by starting with the initial prediction, then adding the scaled value from the first tree and the second tree and the third tree, etc, etc, etc. Once the math is all done. We are left with the predicted weight in this case. We predicted that this person weighed 70 kilograms triple bomb in summary when gradient boost is used for regression. We start with a leaf. That is the average value of the variable. We want to predict in this case. We want it to predict wait. Then we add a tree based on the residuals. The difference between the observed values and the predicted values and we scale the tree’s contribution to the final prediction with a learning rate. Then we add another tree Based on the new residuals, adding trees based on the errors made by the previous tree. That’s all there is to it. Bam tune in for part 2 in this series when we dive deep into the math behind the gradient boost algorithm for regression, walking through its step by step and proving that it really is this simple. Hooray, we’ve made it to the end of another exciting stat quest. If you like this stat quest and want to see more, please subscribe. And if you want to support stat quest, consider buying one of my original songs or buying a stat quest t-shirt or hoodie. The links are in the description below. Alright, until next time quest on.