Transcript:
StatQuest helps you expose hooray layer by layer Hello, I’m Josh Stormer, welcome to StatQuest In this issue of StatQuest, we will use SVD to understand PCA step by step SVD (Singular Value Decomposition) You will learn what PCA does How it works and how to use it to dig deeper into your data Let’s start with a simple data set We measured the gene transcription gene 1 and gene 2 of 6 mice If you don’t want to talk about mice and genes Treat the mice as individual samples Think of genes as variables that we measure for each sample For example, the sample can be high school students, and the variables can be math and reading test scores Or the sample can be companies, and the variables can be market capitalization and number of employees Now we go back to mice and genes, because I am a geneticist and work in the genetics department If we only measure one gene. We can plot the data on a number line Mouse 1 2 and 3 have relatively high values The values of mice 4, 5 and 6 are relatively low Even though this is a simple graph, it also shows that mice 1 2 and 3 are more similar to each other than they are Strong similarity between mice 4 5 and 6 If we measure two genes, we can plot the data on a two-dimensional XY graph Gene 1 is the x-axis, in this picture it covers one dimension in two dimensions Gene 2 is the y-axis and crosses another dimension We can see mice 1 2 and 3 gather on the right side Mice 4, 5 and 6 are concentrated on the lower left side If we want to measure three genes, we will add another axis to the graph to make it look like 3d, that is Three-dimensional These smaller points have larger values for gene 3 and are farther away These larger points have smaller gene 3 values and are closer If we measure 4 genes, we will no longer be able to plot the data 4 genes need four dimensions 🙁 So we will discuss how PCA performs four or more genetic measurements, namely Four or more dimensions of data, and make a two-dimensional PCA chart This piece will show us that similar mice gather together We will also discuss how PCA tells us which gene or variable is most valuable for data clustering? For example, PCA may tell us that gene 3 is responsible for separating samples along the x-axis Finally, we will discuss the enlightenment that PCA tells us about the accuracy of 2d graphics To understand the function of PCA and how it works, let’s go back to a data set with only two genes We will start by plotting the data Then, we calculate gene 1 and Average number of measurements for gene 2 Through the average, we can calculate the center of the data From this moment on, we will focus on what is happening in the graph, we no longer need the original data Now we move the data so that the center is above the origin (0,0) in the graph Note that moving the data does not change the relative position of the data points to each other This point is still the highest This is still the rightmost point and many more Now the data is centered on the origin We can try to fit a line to it We first draw a random line through the origin Then we rotate the line until it fits the data as best as possible and still crosses the origin In the end, this line has the best fit But what I’m talking about is a bit “super-class”, first we need to talk about how PCA determines the degree of fit So let’s go back to the beginning of this random line that passes through the origin To quantify how well this line fits the data, PCA projects the data onto the line Then it can measure the distance of the data to the line and try to find the line that minimizes these distances Or you can try to find the line that maximizes the distance from the projection point to the origin If you fail to understand the meaning of these two methods You can observe how these distances shrink when the line fits better to understand And when the line fits better, these distances will become larger Now, to explain it mathematically, let us consider only one data point This point is fixed, and its distance from the origin is also fixed In other words, the distance from the point to the origin will not change No matter how the red dotted line rotates When we project this point onto the line We get the right angle between the black dotted line and the red dotted line This means that if we mark the side like this: ab and c We can use the Pythagorean Theorem to prove that B and C are inversely related Since a (hence the square of a) is fixed When b becomes larger Then c must become smaller Similarly, if c becomes larger, then b must become smaller Therefore, PCA can minimize the distance from the line, or Maximize the distance from the projection point to the origin I tried to describe this with great fanfare to express Intuitively speaking, it is reasonable to minimize b, the distance from the point to the line But in fact c, the distance from the projection point to the origin, is easier to calculate So PCA finds the fitted line through Maximize the sum of the squares of the distance from the projection point to the origin So for this line PCA projects the data onto it Then measure the distance from this point to the origin Let’s call it d1 Note that I will write down the distance we measured Then PCA measures the distance from this point to the origin. We call it d2 Then measure d3 d4 d5 and d6 This is all 6 distances we measured Next we calculate the square of all distances The distance is square rooted, so negative values will not cancel out positive values Then we sum all these square root distances, equal to the sum of squared distances (SSD) For simplicity, we call it SS (distance), or the sum of squared distances Now we rotate this line Project data online Then add the square of the distance from the projection point to the origin Then repeat until you get a line with the largest sum of squared distances from the projected point to the origin Finally, we get this line, which has the largest sum of squares (distance) This line is called principal component 1 (or PC1) The slope of PC1 is 0.25 In other words, we move 4 units outward along the gene 1 axis We ascend one unit along the gene 2 axis This means that most of the data is distributed along the gene 1 axis Spread only a little bit along the gene 2 axis One way to understand PC1 is to imagine it as a cocktail recipe To make PC1 Mix 4 copies of gene 1 And 1 copy of gene 2 Pour ice cubes and serve! The ratio of gene 1 to gene 2 Tell us that gene 1 has a greater impact when describing how the data is distributed Oh no! Terminology warning! Mathematicians call this cocktail recipe a linear combination of genes 1 and 2. I mention this because when someone says that PC1 is a linear combination of variables They are talking about this No big deal The recipe of PC1 has gone through 4 grids and moved up 1 grid to bring us to this point We can use the Pythagorean Theorem to find the length of the red line The ancient a2=b2+c2 Substituting the numbers to get a=4.12 So the length of the red line is 4.12 When you use SVD to calculate PCA, the formula of PC1 is scaled so that the length is equal to 1. To scale the triangle so that the length of the red line is 1 unit, all we have to do is to divide each side by 4.12 For those who delve into it This is the calculated mathematical result, it shows that all we need to do is divide all three sides by 4.12 This is the converted value The new value has changed our formula But the ratio is the same, the gene 1 we use is still four times that of gene 2. So now we look back at the data, The best fit line and the unit vector we just calculated Oh no! Another term warning! The unit-length vector (Composed of 0.97 gene 1 and 0.24 gene 2) Singular vector or feature vector called PC1 The proportion of each gene is called the load score Simultaneously pca converts the sum of the squares of the best fit line distance Eigenvalues of PC1 The square root of the eigenvalue of pc1 is called the singular value of PC1 BAM! ! ! A lot of terms! ! ! Now that we have figured out PC1, we have changed to PC2! ! ! Because this is just a two-dimensional graph PC2 is a line passing through the origin, perpendicular to PC1, there is no more optimization to be done This means that the formula of PC2 is -1 copy of gene 1 for 4 copies of gene 2 If we scale them to get the unit vector, the formula is -0.242 copies of gene 1 and 0.97 copies of gene 2 This is the singular vector of PC2 or the feature vector of PC2 These are the load scores of PC2 This tells us about how these values are projected onto PC2. The degree of influence of gene 2 is 4 times that of gene 1 Finally, the eigenvalue of pc2 is the sum of the squares of the distance between the projection point and the origin We have got PC1 and PC2 To draw the final PCA diagram, we just need to rotate everything so that PC1 is horizontal Then we use the projection point to locate the sample position in the PCA map For example, these projection points correspond to sample 6 So sample 6 is here Sample 2 is here Sample 1 went here and so on Double bam! ! ! This is how to get PCA with SVD Fortunately, before we dive into a more complex example, there is one last thing Remember the eigenvalues? We get by projecting the data onto the principal components Measure the distance to the origin, then take the square and add up We can convert it into the difference around the origin by dividing it by (sample size-1) For this example Suppose the difference of PC1 is equal to 15, and the difference of PC2 is equal to 3. This means that the total difference between the two PCs is 15 plus 3 equals 18. This means that PC1 accounts for 15 divided by 18 = 0.83 Or 83% of the total difference on the PC PC2 accounted for 3/18 (=17%) 17% of the total difference in PC Oh no! Another term warning! ! ! ! ! The gravel map is an image presentation method used to depict The difference rate of each PC We will discuss the gravel diagram later BAM! ! ! Okay, let’s quickly go through a slightly more complicated example PCA with 3 variables (that is, 3 genes) is similar to the example with 2 variables You center the data Then you find the best fit line through the origin As before, the best fit line is PC1 But now the formula of PC1 contains three ingredients In this example, gene 3 is the component that has the greatest impact on PC1 Then you find PC2, the second fitting line, which passes through the origin and is perpendicular to PC1 This is the recipe for PC2 Here, gene 1 is the most important component of PC2 Finally we find PC3 Cross the origin and perpendicular to the best fit line of PC1 and PC2 If we have more genes, we continue to find more and more principal components By adding vertical lines and rotating them In theory, each gene or variable has a PC, but in fact the number of PCs is either the number of variables or Number of samples, who chooses who If you are confused, don’t worry This is not very important, and I will make a separate video on this topic next week Once all the principal components are determined, you can use the eigenvalues, the sum of the squares of the distance Determine the difference rate of each PC Here, PC1 accounts for 79% of the difference PC2 accounts for 15% of differentiation PC3 accounted for 6% of the difference This is a gravel map PC1 and PC2 account for most of the difference This means that only 2d graphics of PC1 and PC2 are used Can approximate this 3d graph very well because it can explain 94% of the data difference Convert 3d graph to two-dimensional PCA graph We get rid of everything except data and PC1&PC2 Then project the sample to PC1 And PC2 Then rotate so that PC1 is horizontal and PC2 is vertical, (visually easier to read) These projection points correspond to sample 4 This is the position of sample 4 on our new PCA chart Wait wait wait Double bam!!! To recap, let’s start with a difficult and awkward 3d graphic Then we calculate the principal components Then, use the eigenvalues of PC1 and PC2 We found that the 2D image can be displayed comprehensively Finally, we PC1 and PC2 draw a two-dimensional graph covering the data If we measure 4 genes for each mouse, we cannot draw a four-dimensional graph of the data 🙁 But this doesn’t stop us from doing PCA math problems (It doesn’t matter whether we can draw graphics or not) and view the gravel map it’s here PC1 and PC2 account for 90% of the difference, so we can only use them to draw a two-dimensional PCA diagram So we project the samples onto the first two PCs These two projection points correspond to sample 2 So sample 2 went here BAM Please note that if the gravel map looks like this, PC3 and PC4 account for a considerable difference Then using only the first two PCs cannot accurately represent our data 🙁 However, even unclear PCA diagrams like this can be used to identify data clusters These samples are still more similar to each other than other samples Little BAM! ! ! Hooray, we finished another exciting StatQuest. If you like this StatQuest and want to see more content, please subscribe! If you want to support StatQuest, please consider buying one or two of my original songs The link to my Bandcamp page is in the lower right corner, also in the description box below Ok see you next time