Transcript:

Statcast [Music] Stat quest stat quest stat quest! Hello, I’m Josh Stormer. And welcome to stat quest today. We’re going to be talking about K-means clustering we’re gonna learn how to cluster samples that can be put on a line on an XY graph and even on a heat map and Lastly, we’ll also talk about how to pick the best value for K. Imagine you had some data that you could plot on a line and you knew you needed to put it into three clusters. Maybe they are measurements from three different types of tumors or other cell types. In this case, the data make three relatively obvious clusters, but rather than rely on our eye. Let’s see if we can get a computer to identify the same three clusters to do this. We’ll use k-means clustering. We’ll start with raw data that we haven’t yet clustered. Step one select the number of clusters. You want to identify in your data? This is the K in K-means clustering in this case, We’ll select K equals three. That is to say we want to identify three clusters. There is a fancier way to select a value for K. But we’ll talk about that later. Step two randomly select three distinct data points. These are the initial clusters step 3 measure the distance between the first point and the three initial clusters. This is the distance from the first point to the blue cluster. This is the distance from the first point to the Green Cluster, and this is the distance from the first point to the Orange Cluster. Well, it’s kind of yellow, but we’ll just call it orange. For now. Step 4 assign the first point to the nearest cluster in this case. The nearest cluster is the blue cluster. Now we do the same thing. For the next point, we measure the distances and then assign the point to the nearest cluster. Now we figure out which cluster the third point belongs to. We measure the distances and then assign the point to the nearest cluster. The rest of these points are closest to the Orange Cluster, So they’ll go in that one two now that all the points are in clusters, we go on to step 5 calculate the mean of each cluster, then we repeat what we just did measure and cluster using the mean values since the clustering did not change at all during the last iteration we’re done. Bam, the K-mean’s clustering is pretty terrible compared to what we did by eye. We can assess the quality of the clustering by adding up the variation within each cluster. Here’s the total variation within the clusters since k-mean’s clustering can’t see the best clustering, its only option is to keep track of these clusters and their total variance and do the whole thing over again with different starting points. So here we are again back. At the beginning, K-means clustering picks three initial clusters and then clusters, all the remaining points calculates the mean of each cluster and then re clusters. Based on the new means it repeats until the cluster is no longer change. Bit bit bit of bit of boop boop boop. Now that the data are clustered, we sum the variation within each cluster. And then we do it all again. At this point? K-mean’s clustering knows that the second clustering is the best clustering so far, but it doesn’t know if it’s the best overall, so it will do a few more clusters. It does as many as you tell it to do and then come back and return that one If it is still the best question, how do you figure out what value to use for? K with this data, it’s obvious that we should set K to three, but other times it is not so clear one way to decide is to just try different values for K. We’ll start with K equals 1 K equals 1 is the worst case scenario we can quantify its badness with the total variation. Now try! K equals 2 K equals 2 is better and we can quantify how much better by comparing the total variation within the two clusters to K equals 1 Now try K equals 3 K equals 3 is even better. We can quantify how much better by comparing the total variation within the three clusters to K equals 2 Now try K equals 4 The total variation within each cluster is less than when K equals 3 Each time we add a new cluster. The total variation within each cluster is smaller than before and when there is only one point per cluster, the variation equals 0 however, if we plot the reduction in variance per value for K, there is a huge reduction in variation with K equals three, but after that the variation doesn’t go down as quickly. This is called an elbow plot and you can pick K by finding the elbow in the plot question. How is K-mean’s clustering different from hierarchical clustering, k-means clustering specifically tries to put the data into the number of clusters? You tell it to hierarchical clustering. Just tells you pairwise what two things are most similar question. What if our data isn’t plotted on a number line just like before you pick three random points and we use the Euclidean distance in two dimensions. The Euclidean distance is the same thing as the Pythagorean theorem, then just like before we assign the point to the nearest cluster and just like before we then calculate the center of each cluster and re cluster BAM. Although this looks good, the computer doesn’t know that until it does the clustering a few more times question. What if my data is a heatmap? Well, if we just have two samples, we can rename them X and Y, and we can then plot the data in an XY graph. Then we can cluster just like before note, we don’t actually need to plot the data in order to cluster it. We just need to calculate the distances between things when we have two samples or two axes. The Euclidean distance is the square root of X squared plus y squared when we have three samples or three axes. The Euclidean distance is the square root of X squared, plus y squared plus Z squared, and when we have four samples or four axes, the Euclidean distance is the square root of x squared, plus y squared plus Z squared, plus a squared, etc, etc, etc. Hooray, we’ve made it to the end of another exciting stat quest. If you like this stat quest and want to see more, please subscribe. And if you want to support stack quest, we’ll click the like button down below and consider buying one or two of my original songs. Alright, tune in next time for another exciting stat quest.