Transcript:

That quest is off some. He comes out [Music]. Hello, I’m Josh Stormer. And welcome to stat quest today. We’re gonna be talking about how to do. PCA and Python. It’s gonna be clearly explained, however, before I get started. I need to have a big shout out for Chris Yoga, Zimsky. He wrote the code that I’m gonna be talking about and he’s. The reason why the stat quest exists to begin with here’s. What we’re going to talk about first? We’ll talk about how to make up some data that we can apply PCA to then we’ll talk about how to use the PC, A function from scikit-learn to do. PCA we’ll also talk about how to determine how much variation each principal component accounts for we’ll talk about how to draw a fancy PCA graph using Matplotlib and, Lastly, we’ll talk about how to examine the loading scores to determine what variables have the largest effect on the graph. The first thing we do is import. The pandas package pandas short for panel data makes it easy to manipulate data in python. Then we import the numpy package. Numpy will allow us to generate random numbers and do other mathy things. The random package will be useful for generating an example dataset. If you’re working with real data, you won’t need this package here. We are importing the PC A function from scikit-learn note, just like our there are a few Python PCA functions to choose from the one in Scikit-learn is the most commonly used that I know of the pre-processing package from Psych had learned gives US functions for scaling the data before performing PCA. Lastly, we’ll, import matplotlib so we can draw some fancy graphs. Python being a general-purpose programming language doesn’t have built-in support for tables of data, random number generation or graphing like R, So we import all the stuff we need. It’s no big deal. Now that we’ve imported all the packages and functions we need, let’s generate a sample data set. The first thing we do is generate an array of a hundred gene names. Since this is just an example data set. Our gene names are super boring. Gene, one, gene, two etc. Now we create arrays of sample names. We have five wild-type or WT samples and five knockout or Ko samples. If you don’t know what wild-type or knockout samples are, don’t worry about, it, just know that there’s two different types of samples. Note the range one comma. Six function generates values one two, three, four and five. In other words, it generates all integer values from 1 up to the value less than six. Now we create a panda’s data frame to store the made-up data. The stars unpack the WT and Ko arrays so that the column names are a single array that looks like this. Without the stars, we would create an array of two arrays and that wouldn’t create twelve columns like we want. The gene names are used for the index, which means that they are the equivalent of row names. This is where we finally create the random data for each gene in the index, ie. Gene, one gene, two dot dot. All the way to Gene 100 we create five values for the WT samples and five values for the KO samples. The made up data comes from two Poisson distributions, One for the WT samples and one for the KO samples. Note RNA sequencing aficionados will recognize that the Poisson distribution isn’t the right one to use for RNA. Seek data, no big deal. This is just a made-up dataset, so don’t worry about it for each gene. We select a new mean for the Poisson distribution. The means can vary between 10 and 1000 The head method for our data frame returns the first five rows of data. This is useful for verifying that the data look the way we expect they should. The shape attribute returns the dimensions of our data matrix. In this case, we get a hundred comma 10 100 genes by 10 total samples. Now that we’ve created a sample data set, let’s do PCA. Actually, before we do PCA, we have to Center and scale the data after centering the average value for each gene will be zero and after scaling the standard deviation for the values for each gene will be 1 notice that we are passing in the transpose of our data. The scale function expects the samples to be in rows instead of columns note. We use samples as columns in this example, because that is often how Genomic data is stored. If you have other data, you can store. It, however, is easiest for you. There’s no requirement that the samples be rows or columns. Just be aware that if it is columns, you’ll need to transpose it before analysis. One other note, this is just one way to use scikit-learn to Center and scale the data so that the means for each gene are 0 and the standard deviations for each gene are 1 Alternatively, we could have use standard scalar fit, underscore transform. The second method is more commonly used for machine learning and that’s what Scikit-learn was designed to do. That’s why it’s available one last note about scaling with scikit-learn versus using scale or per comp in our to do. The scaling in scikit-learn variation is calculated as the measurements minus the mean squared divided by the number of measurements. In contrast in our using scale or per comp variation is calculated as the measurements minus the mean squared divided by the number of measurements minus one, this method results in larger but unbiased estimates of the variation. The good news is that these differences do not affect the PCA analysis, The loading scores and the amount of variation per principal component will be the same either way. The bad news is that these differences will have a minor effect on the final graph. This is because the coordinates on the final graph come from multiplying the loading scores by the scaled values. Now we create a PCA object rather than just have a function that does PCA and returns results, Scikit-learn uses objects that can be trained using one data set and applied to another data set since were only using PCA to explore one data set and not using PCA in a machine. Learning setting the additional steps are a little tedious, but they set us up for the machine learning topics that I’ll be covering very soon. Then we call the fit method on the scale data. This is where we do all the PCA math, ie. We calculate the loading scores and the variation each principal component accounts for Lastly, this is where we generate coordinates for a PCA graph based on the loading scores in the scaled data. Now we’re ready to draw a graph. We’ll start with a scree plot to see how many principle components should go into the final plot. The first thing we do is calculate the percentage of variation that each principal component accounts for now we create labels for the scree plot. These are PC-1, PC2, etc, one label per principal component. Now we use matplotlib to create a bar plot and here’s what the scree plot looks like with all the fancy annotation that we added using the Y label. X label and title methods. Almost all of the variation is along the first principal component, so a 2d graph using PC1 & PC2 should do a good job representing the original data to draw a PCA plot. Well first put the new coordinates created by PCA transform of the scaled data into a nice matrix where the rows have sample labels and the columns have PC labels. These commands draw chatter plot with the title and nice access labels, and this loop adds sample names to the graph. Lastly, we display the graph there. It is wow, so awesome! Bam, the WT samples clustered on the left side, suggesting that they are correlated with each other. The KO samples clustered on the right side, suggesting that they are correlated with each other, and the separation of the two clusters along the X-axis suggests that the wild-type samples are very different from the knockout samples. Lastly, let’s look at the loading scores for PC One to determine which genes had the largest influence on separating the two clusters along the X axis, well start by creating a panda’s series object with the loading scores in principal component one note. The principal components are zero index, so PC one equals zero. Now we sort the loading scores based on their magnitude, aka. We take the absolute value here. We are just getting the names of the top ten indexes, which are the gene names. Lastly, we print out the top ten gene names and their corresponding loading scores. And this is what we get. These values are super similar, so a lot of genes played a role in separating the samples rather than just one or two double bam! Hooray, we’ve made it to the end of another exciting stat quest. If you like this stat quest and want to see more, please subscribe. And if you have any suggestions for future stat quests, we’ll put them in the comments below until next time quest on.