Transcript:

Regression tree is for you and for me, stat quest, hello. I’m Josh Starman. Welcome to stat quest today. We’re going to talk about regression trees, and they’re gonna be clearly explained. This stat quest assumes you are already familiar with the trade-off that plagues all of machine, learning the bias-variance tradeoff and the basic ideas behind decision, trees and the basic ideas behind regression. If not, check out the quests, the links are in the description below. Now imagine we developed a new drug to cure the common cold. However, we don’t know the optimal dosage to give patients, so we do a clinical trial with different dosages and measure how effective each dosage is, the data looked like this and in general, the higher the dose, the more effective the drug, then we could easily fit a line to the data, and if someone told us they were taking a 27 milligram dose, we could use the line to predict that a 27 milligram dose should be 62% effective, however, what if the data looked like this low dosages are not effective, moderate dosages work really well somewhat higher dosages work at about 50% effectiveness and high dosages are not effective at all in this case fitting a straight line to the data will not be very useful, For example, If someone told us they were taking a 20 milligram dose. Then we would predict that a 20 milligram dose should be 45% effective, even though the observed data says it should be 100% effective, so we need to use something other than a straight line to make predictions. One option is to use a regression tree. Regression trees are a type of decision tree in a regression tree. Each leaf represents a numeric value contrast, classification trees have true or false in their leaves or some other discrete category with this regression tree. We start by asking if the dosage is less than 14.5 if so, then we are talking about these six observations in the training data and the average drug effectiveness for these six observations is 4.2 percent, so the tree uses the average value for point two percent as its prediction for people with dosages less than fourteen point five on the other hand, if the dosage is greater than or equal to 14.5 and greater than or equal to 29 Then we are talking about these four observations in the training data set and the average drug effectiveness for these four observations is 2.5 percent, so the tree uses the average value 2.5 percent as its prediction for people with dosages greater than or equal to 29 Now, if the dosage is greater than or equal to 14.5 and less than 29 and greater than or equal to 23 point 5 Then we are talking about these 5 observations in the training data set and the average drug effectiveness for these 5 observations is 52.8% so the tree uses the average value 52.8% as its prediction for people with dosages between 23 point five and 29 Lastly, if the dosage is greater than or equal to 14.5 and less than 29 and less than 23 point 5 Then we are talking about these four observations in the training data set and the average drug effectiveness for these four observations is 100% so the tree uses the average value 100% as its prediction for people with dosages between 14.5 and 23.5 since each leaf corresponds to the average drug effectiveness in a different cluster of observations. The tree does a better job, reflecting the data than the straight line. At this point? You might be thinking. The regression tree is cool, but I can also predict drug effectiveness just by looking at the graph, For example, if someone said they were taking a 27 milligram dose, then just by looking at the graph. I can tell that the drug will be about 50% effective, so why make a big deal about the regression tree? When the data are super simple and we are only using one predictor dosage to predict drug effectiveness, making predictions by eye isn’t terrible, but when we have three or more predictors like dosage age and sex to predict drug effectiveness, drawing a graph is very difficult if not impossible. In contrast, a regression tree easily accommodates the additional predictors, for example. If we wanted to predict the drug effectiveness for this patient, we would start by asking if they are older than 50 and since they are not over 50 we follow the branch on the right and ask if their dosage is greater than or equal to 29 and since their dosage is not greater than or equal to 29 We follow the branch on the right and ask if they are female, and since they are female, we follow the branch on the left and predict that the dosage will be 100% effective and that’s not too far off from the truth 98% Okay, now that we know that regression trees can easily handle complicated data. Let’s go back to the original data with just one predictor dosage and talk about how to build this regression tree from scratch and since regression trees are built from the top down. The first thing we do is figure out why we start by asking. If dosage is less than 14.5 going back to the graph of the data, let’s focus on the two observations with the smallest dosages. Their average dosage is three, and that corresponds to this dotted red line. Now we can build a very simple tree that splits the observations into two groups, based on whether or not dosage is less than three. The point on the far left is the only one with dosage less than three and the average drug effectiveness for that one point is zero, so we put zero in the leaf on the left side for when dosage is less than three. All of the other points have dosages greater than or equal to three and the average drug effectiveness for all of the points with dosages greater than or equal to three is thirty eight point eight, so we put 38.8 in the leaf on the right side for when dosage is greater than or equal to three, the values in each leaf are the predictions that this simple tree will make for drug effectiveness, For example, this point on the far left has dosage less than three, and the tree predicts that the drug effectiveness will be zero the prediction for this point. Drug effectiveness equals zero is pretty good since it is the same as the observed value, in contrast for this point, which has dosage greater than three, the tree predicts that the drug effectiveness will be 38.8 and that prediction is not very good since the observed drug effectiveness is 100% note. We can visualize how bad the prediction is by, drawing a dotted line between the observed and predicted values in other words. The dotted line is a residual for each point in the data we can draw. Its residual. The difference between the observed and predicted values and we can use the residuals to quantify the quality of these predictions, starting with the only point with dosage less than three, we calculate the difference between its observed drug effectiveness zero and the predicted drug effectiveness zero and then square the difference in other words. This is the squared residual for the first point. Now we add the square residuals for the remaining points, with dosages greater than or equal to three. In other words. For this point, we calculate the difference between the observed and predicted values and square it and then add it to the first term. Then we do the same thing for the next point and the next point and the rest of the points do – II II II II II II II II II II II II II II II II II until we have added squared residuals for every point, thus to evaluate the predictions made when the threshold is dosage less than three, we add up the squared residuals for every point and get twenty seven thousand four hundred. Sixty eight point five note we can plot the sum of squared residuals on this graph, the Y axis corresponds to the sum of squared residuals and the X axis corresponds to dosage thresholds. In this case, the dosage threshold was three, but if we focus on the next two points in the graph and calculate their average dosage, which is five, then we can use dosage less than five as a new threshold and using dosage less than five gives US new predictions and new residuals, and that means we can add a new sum of squared residuals to our graph in this case, the new threshold dosage less than five results in a smaller sum of squared residuals, and that means using dosage less than five as the threshold resulted in better predictions overall. Bam, now let’s focus on the next two points calculate their average, which is seven and use dosage less than seven as a new threshold again. The new threshold gives US new predictions, new residuals and a new sum of squared residuals now shift the threshold over to the average dosage for the next two points and add a new sum of squared residuals to the graph, and we repeat until we have calculated the sum of squared residuals for all of the remaining thresholds. T 2 T 2 2 2 2 2 2 Bam. Now we can see the sum of squared residuals for all of the thresholds and dosage less than 14.5 has the smallest sum of squared residuals so dosage less than 14.5 will be the root of the tree. In summary, we split the data into two groups by finding the threshold that gave us the smallest sum of squared residuals. Bam, now! Let’s focus on the six observations with dosage less than 14.5 that ended up in the node to the left of the root. In theory, we could split these six observations into two smaller groups, just like we did before by calculating the sum of squared residuals for different thresholds and choosing the threshold with the lowest sum of squared residuals note, this observation has dosage less than 14.5 and does not have dosage less than 11.5 so it is the only observation to end up in this node, and since we can’t split a single observation into two groups, we will call this node a leaf. However, since the remaining five observations go to the other node, we can split them. Once more now, we have divided the observations with dosage less than 14.5 into three separate groups. These two leaves only contain one observation each and cannot be split into smaller groups. In contrast, this leaf contains four observations that said those four observations all have the same drug effectiveness, so we don’t need to split them into smaller groups, so we are done splitting the observations with dosage less than 14.5 into smaller groups note the predictions that this tree makes for all observations with dosage less than 14.5 are perfect, in other words. This observation has 20% drug effectiveness and the tree predicts 20% drug effectiveness, so the observed and predicted values are the same. This observation has 5% drug effectiveness, and that’s exactly what the tree predicts. These four observations all have 0% drug effectiveness, and that’s exactly what the tree predicts. Is that awesome? No, when a model fits the training data perfectly, it probably means it is over fit and will not perform well with new data in machine learning lingo, the model has no bias, but potentially large variants bummer. Is there a way to prevent our tree from overfitting the training data? Yes, there are a bunch of techniques. The simplest is to only split observations when there are more than some minimum number. Typically, the minimum number of observations to allow for a split is 20 however, since this example, doesn’t have many observations. I set the minimum to 7 in other words. Since there are only six observations with dosage less than 14.5 we will not split the observations in this node. Instead, this node will become a leaf and the output will be the average drug effectiveness for the six observations, with dosage less than 14.5 4.2% Bam, now we need to figure out what to do with the remaining 13 observations with dosages greater than or equal to 14.5 since we have more than 7 observations on the right side, we can split them into two groups, and we do that. By finding the threshold that gives us the smallest sum of squared residuals note, there are only four observations with dosage greater than or equal to 29 Thus, there are only four observations in this node. Thus we will make this a leaf because it contains fewer than seven observations and the output will be the average drug effectiveness for these four observations 2.5% Now we need to figure out what to do with the nine observations with dosages between 14.5 and 29 since we have more than seven observations, we can split them into two groups by finding the threshold that gives us the minimum sum of squared residuals note since there are fewer than seven observations in each of these two groups. This is the last split because none of the leaves have more than seven observations in them, so we use the average drug effectiveness for the observations with dosages between fourteen point five and twenty-three point five one hundred percent as the output for the leaf on the right, and we use the average drug effectiveness for observations with dosages between twenty three point five and twenty nine fifty two point eight percent as the output for the leaf on the Left. Since no leaf has more than seven observations in it we’re done building the tree, and each leaf corresponds to the average drug effectiveness from a different cluster of observations. Double BAM. So far, we have built a tree using a single predictor dosage to predict drug effectiveness. Now let’s talk about how to build a tree to predict drug effectiveness using a bunch of predictors just like before we will start by using dosage to predict drug effectiveness, thus just like before we will try different thresholds for dosage and calculate the sum of squared residuals at each step and pick the threshold that gives us the minimum sum of squared residuals. The best threshold becomes a candidate for the root. Now we focus on using age to predict drug effectiveness just like with dosage. We tried different thresholds for age and calculate the sum of squared residuals at each step and pick the one that gives us the minimum sum of squared residuals. The best threshold becomes another candidate for the route. Now we focus on using sex to predict drug effectiveness with sex. There is only one threshold to try, so we use that threshold to calculate the sum of squared residuals and that becomes another candidate for the route. Now we compare the sum of squared residuals. Ssrs for each candidate and pick the candidate with the lowest value since age, greater than 50 had the lowest sum of squared residuals. It becomes the root of the tree. Then we grow the tree just like before, except now we compare the lowest sum of squared residuals from each predictor and just like before when a leaf has less than a minimum number of observations, which is usually 20 but we are using 7 We stop trying to divide them Triple bow. In summary. Rushon trees are a type of decision tree in a regression tree. Each leaf represents a numeric value we determine how to divide the observations by trying different thresholds in calculating the sum of squared residuals at each step. PP boo boo beep, beep, boo boo peepee poo-poo. The threshold with the smallest sum of squared residuals, becomes a candidate for the root of the tree. If we have more than one predictor, we find the optimal threshold for each one and pick the candidate with the smallest sum of squared residuals to be the root when we have fewer than some minimum number of observations in a node 7 in this example, but more commonly 20 then that node becomes a leaf. Otherwise we repeat the process to split the remaining observations until we can no longer split the observations into smaller groups. And then we are done. Hooray, we’ve made it to the end of another exciting stat quest if you liked this stack quest and want to see more, please subscribe. And if you want to support Stack Quest, consider contributing to my patreon campaign, buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below. Alright until next time quest on.