Transcript:

StatQuest is growing upwards StatQuest is growing down StatQuest is growing. Helo! I’m Josh Starmer. Welcome to StatQuest Today, we’ll talk about decision trees Here is a simple decision tree If a person loves StatQuest theme songs So, this person is amazing and if a person doesn’t love the StatQuest theme songs So this person is a little less than incredible In general, a decision tree asks a question and then classifies the person based on the response It is nothing too much! This decision tree is based on a yes or no question But it’s as easy as building a tree from numerical data If a person has a very high heart rate So, it is better for this person to see a doctor and if a person doesn’t have a super high heartbeat then, that person is fine. Here’s another simple decision tree This decision tree is based on classification data where 1 is super hungry and 2 is moderately hungry If a person is super hungry, they need to eat and if a person is moderately hungry she just needs a snack. and if she’s not hungry at all so, you don’t need to eat Note: classification can be in categories or numeric In this case, we are using the weight of the mice to predict the size of the mice Here is a more complicated decision tree It combines numerical data with data yes or no Note that the heart rate cut is not always the same In this case, it’s 100 bpm on the left side and 120 bmp on the right side and the order of the questions on the left side first about heartbeat and then about eating donuts it doesn’t have to be the same on the right side On the right side, the donut question appears first Finally, the final classification can be repeated In general, decision trees are very intuitive to work with You start at the top and it goes down going down Until you reach a point where you can’t go any further and this is how you classify a sample. Oh no! Jargon alert !! The top of the tree is called the “root node” or just: “the root” these are called internal nodes or just: “nodes” Internal nodes have arrows pointing towards you and for arrows pointing outward Finally, these are called “leaf nodes” or just “leaves” Leaf nodes have arrows pointing towards you but there are no arrows pointing out of them Now, we’re ready to talk about how to start from a data table for one to reach a decision tree In this example we want to create a tree that uses “chest pain” “good blood circulation” and “blocked arteries” to predict Whether or not a patient has heart disease. The first thing we want to know is whether “chest pain”, “good blood circulation” or “blocked arteries” should be at the top of our tree We will start by looking at how well “chest pain” only predicts heart disease. Here is a little tree that only considers “chest pain” The first patient has no chest pain and you don’t have heart disease. and we store that information here The second patient has chest pain and heart disease and we store that information here The third patient has chest pain but you don’t have heart disease The fourth patient has chest pain and heart disease In the end, we look at “chest pain” and “heart disease” of all 303 patients in this study Now we will do exactly the same thing for “good blood circulation” Finally, we will look like “blocked arteries” separates patients with and without heart disease Since we don’t know whether that patient has blocked arteries or not Let’s skip it. However there are alternatives that I will discuss in a future video Remember if: the goal is to decide whether “chest pain” “good blood circulation” or “blocked arteries” should be the first thing in our decision tree also called “root node” So, we looked at how well “chest pain” separates patients with and without heart disease It was good, but it was not perfect. Most patients with heart disease ended up in that leaf node and most patients without heart disease ended up in that leaf node. So, we looked at how well “good blood circulation” separates patients with and without heart disease. It was also not perfect. Finally we check how well “blocked arteries” separates patients with and without heart disease. note: The total number of patients with heart disease is different for “chest pain” “good blood circulation” and “blocked arteries” because some patients have the measure for “chest pain” but not for “blocked arteries”, etc. Oh no! It’s another one of those sinister jargon alerts !! Because none of the root nodes is 100% “yes, it is heart disease” or 100% “it is not heart disease” they are all considered unclean. To determine which separation is the best we need a way to measure and buy impurity There are several ways to measure impurity but I’m going to focus on a very popular one called Gini. To be honest I don’t know why it is called “Gini” I looked on the internet and I couldn’t find anything However, if you know Please put in the comments below. I would love to know. Anyway The good news is that calculating Gini impurity is easy! Let’s start by calculating Gini’s impurity for “chest pain” For this leaf, the impurity of Gini is equal 1 minus the probability of “yes” squared minus the probability of “no” squared. Now, let’s add the numbers The probability of “yes” is 105 divided by the total number of people in that leaf node and the probability of “no” is 39 divided by the total number of people in that leaf node After we do the math we get 0.395 I.e The Gini impurity for the left leaf node is equal to 0.395 Now, let’s calculate the Gini impurity for this leaf node the one on the right Just like before equals 1 minus the probability of “yes” squared minus the probability of “no” squared. The probability of “yes” is 34 divided by the total number of people in that leaf node and the probability of “no” is equal to 125 divided by the total number of people in that node. and if we do the math we will get 0.336 Now that we’ve measured Gini’s impurity for both leaf nodes we can calculate the total Gini impurity when using “chest pain” to separate patients with and without heart disease. Because this leaf node represents 144 patients and this leaf node represents 159 patients the leaf nodes do not represent the same number of patients Thus, the total Gini impurity in the use of “chest pain” to separate patients with and without heart disease is the weighted average of the impurities of the leaf nodes. So, to calculate the weighted average we take the total number of people in the left leaf node and divided by the total number of people in both leaf nodes then we multiply that fraction by the Gini impurity of the leaf node on the left. Then, we take the total number of people in the leaf node on the right divided by the total number of people in both leaf nodes and then we multiply that fraction by the Gini impurity of the leaf node on the right. After doing the math, we get 0.364 Thus, the total Gini impurity for “chest pain” is equal to 0.364 And since I’m a nice guy I’m going to cut the conversation and tell that Gini’s impurity for “good blood circulation” is equal to 0.360 and Gini’s impurity for “blocked arteries” is equal to 0.381 “good blood circulation” has the least impurity it is the best to separate patients with and without heart disease So, we’ll use it at the root of the tree note: when we divide all patients using “good blood circulation” we are still with “impure” leaf nodes That is, each leaf contains a mixture of patients with and without heart disease. This means that the 164 patients with and without heart disease that came to this leaf node now they are in that node of the tree and the 133 patients with and without heart disease that came to this leaf node now they are in that node of the tree. Now, we need to find out how well “chest pain” and “blocked arteries” separate these 164 patients 37 with heart disease and 127 without heart disease. Just like we did before we separated these patients based on “chest pain” and then we calculate the Gini impurity value. In this case, it is 0.3 and then we do exactly the same thing for “blocked arteries” since “blocked arteries” have the least Gini impurity we will use it in this node to separate patients. Here’s the tree we’ve built so far. We start at the top, separating patients with good circulation Then we use “blocked arteries” to separate patients on the left side of the tree. All we had left was “chest pain” So, first let’s see how well it separates these 49 patients 24 with heart disease and 25 without heart disease. Nice! “chest pain” does a good job of separating patients. So, these are the leaf nodes on that branch of the tree. Now, let’s see what happens when we use “chest pain” to divide these 115 patients 13 with heart disease and 102 without. Note: the vast majority of patients in this node 89% have no heart disease. Here’s how “chest pain” separates these patients. Do these two new sheets separate patients better than before? Well, let’s calculate the Gini impurity to see In this case, it is 0.29 Gini’s impurity for this node before using “chest pain” to separate patients is 0.2 impurity is less if we do not separate patients using “chest pain” So, let’s make it a leaf node. Okay, at that point we solved the entire left side of the tree. Now, we need to solve the right side of the tree. The good news is that we follow exactly the same steps that we follow on the left side First, we calculate all Gini impurity values Second, if the node itself has the lowest value so, there is no point in continuing to separate patients and it becomes a leaf node Third, if the separation of the data results in an improvement then, choose the separation with the lowest impurity value Ueba! We made a decision tree! So far, we’ve seen how to build a tree with “yes or no” questions at each stage. but what if we had numerical data like the patient’s weight? Imagine if that were our data. How we determine what is the best weight to divide patients? Step 1 Sort patients by weight: from smallest to largest. Step 2 Calculate the average weight for all adjacent patients Step 3 Calculate the impurity value for each average weight For example we can calculate the impurity value for weights below 167.5 In the end, we got 0.3 as the impurity value for that weight And then, we calculate the impurity values for the other weights too The least impurity occurs when we separate the weights for 205. So, this is the cut in the impurity value that we will use when we compare the weight with “chest pain” or “blocked arteries”. Now we’ve seen how to build a tree with “yes or no” questions at each step and numerical data, such as patient weight. Now, let’s talk about classification data like “rate my jokes on a scale of 1 to 4” and multiple choice data like “what color do you prefer? red, blue or green?” Classification data is similar to numeric data except that now, we calculate impurity values for all possible classifications. So, if people could rate my jokes from 1 to 4 4 being the funniest We could calculate the following impurity values: jokes note 1 or less jokes note 2 or less and jokes note 3 or less. Note: we don’t need to calculate the impurity value for jokes grade 4 or less because that would include all the jokes When there are multiple choices like “colors can be blue, green or red” you calculate the impurity value for each as well as for each combination. For this example, with three colors (blue, green and red) we get the following options Blue chosen color Chosen color green Chosen color red Color chosen blue or green Color chosen blue or red and finally, color chosen green or red Note: we do not need to calculate the impurity value for choosing blue, green or red since that includes all the choices. Bam! Now we know how to make and use Decision Trees Follow the channel to learn about Random Forests That’s when the fun really begins! Horay! We have reached the end of yet another StatQuest. If you liked this StatQuest and want to see more, please subscribe and if you have suggestions for future StatQuests, post them in the comments below. Until next time!