Transcript:
Hurricane Florence came just in time … .. that I was working on at StatQuest … Dark clouds covered the sky … but that didn’t stop StatQuest … StatQuest !!! StatQuest !!! Hello, my name is Josh Starmer, and welcome to StatQuest. Today we are going to talk about some fundamentals of Machine Learning. Bias and Variance. And they will be explained in a simple way. Suppose we measure weight and height of a group of rats and we plotted the data on a graph … Skinny rats have a tendency to be short and the heavier ones tend to be taller. but after a certain level of weight, they don’t grow any more, they just become more obese. Given this data, we would like to predict the height of a mouse given its weight. For example: if you told me that your rat weighed this much, then we would predict that it is high at this value. In an ideal situation, we would know the exact mathematical formula which describes the relationship between weight and height. But in this case, we don’t know her then we will use two machine learning methods to make an approximation of this relationship. Still, I will leave the relationship “true” described in the figure for reference. The first thing we are going to do, is to divide the data in two sets. One for training machine learning algorithms and another for their test. The blue dots are the training set and the green dots represent the test set. Here we have only the training set: The first machine learning algorithm that we will use is Linear Regression (also known as as Least Squares). Linear Regression fits a straight line to the training set. Note that a straight line has no the flexibility needed to faithfully replicate the arc in the “true” relationship. No matter how much we try to adjust the line, it will never buckle. Therefore, the straight line will never represent well the true relationship between weight and height no matter how well we adjust it to the training set. The inability of a machine learning algorithm such as Linear Regression, of being able to represent true relationship is called ViĆ©s (Bias) Since the straight line cannot be curved like the “true” relationship, we say that it has a relatively high amount of bias. Another machine learning method could fit a winding line to the training set … The winding line is very flexible, and follows the training set along the arc of the true relationship. How the winding line manages to follow the arc of the true relationship between weight and height, we say that it has little bias. We can compare how well the Straight Line and the Winding Line fit the training data, calculating the sum of the squares. In other words, we measure all distances from the adjustment lines to the data, we squared and we add them all up. Psst! They are squared, so “negative” distances they do not cancel out the effect of the positive ones. Note that the winding line fits the data so well, that the distances between it and the data they are all zero. In the dispute to see if the straight line fits the training set data better than the winding line, the winding line wins! But remember: so far we’ve only calculated the Sum of Squares on the training set. We still have the test set! Now let’s calculate the Sum of Squares over the test set. In the dispute to see if the straight line fits the test set better than the winding line, The straight line wins! Although the winding line does a great job when trying to adjust to the training set, she did a terrible job trying to adjust to the test set. In terms of Machine Learning, the difference in fit between data sets is called Variance. The winding line has a low bias, because it is very flexible and can adjust well to the curve of the “true” relationship between weight and height. But the winding line has high variability, because it results in a high sum of squares difference for different data sets. In other words, it is difficult to say how much the winding line would work with other data sets. It can work sometimes, but at other times it can work terribly bad. On the other hand, the straight line has a relatively high bias, because it is not able to represent the curve well of the relationship between weight and height. But the straight line has a relatively low variance, because the Sum of Squares results in similar values for different data sets. In other words, the straight line can only give us GOOD predictions, but not GREAT predictions, but they would still be consistently good predictions. BAM !!! Oh no! [Terminology alert] Since the winding line fits very well into the training set, but not to the test suite, we say that the winding line is overfitted In Machine Learning, the ideal algorithm has a low bias and can adequately represent the true relationship … … And it has low variability, for producing consistent predictions about different data sets. This can be done by finding the ideal balance between a simple model, and a complex model. Oh no! Another [Terminology Alert]! Three very common methods to try to find an ideal balance between a simple model and complicated models are Regularization, “Boosting” and “Bagging” The StatQuests on Random Forests shows an example of “bagging” in action! And we will talk about Regularization and “Boosting” in future StatQuests! Double BAM !!! Alive! We have reached the end of yet another incredible StatQuest. if you liked this StatQuest, and want to see more, please subscribe. And if you want to support StatQuest, well, please consider buying one or two of my original songs. All right, see you next time! Quest On !!!