Transcript:

Starck wins. That’s coming at you. Stick quiz stats gonna find you stick quiz. Watch out hello and welcome to stat. Quest Stat Quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill. Today we’re going to be talking about linear discriminant analysis, which let’s be honest, Sounds really fancy, and it kind of is, but not really. I think we can understand it. Let’s see what it does, and then we’ll work it out. That is. Let’s look at some examples of why we might need linear discriminant analysis, and then we’ll talk about the details of how it works. Imagine that we have this cancer drug and that cancer drug works great for some people, but for other people, it just makes them feel worse Wha-wha. We want to figure out who to give the drug to. We want to give it to people who it’s going to help, but we don’t want to give it to people that it might harm and since I’m a geneticist and I work in a genetics department the way I answer all my questions is to look at gene expression. Maybe Gene expression can help us decide. Here’s an example using one gene to decide who gets the drug. And who doesn’t we’ve got a number line and on the left side? We’ve got fewer transcripts and on the right side. We’ve got more transcripts. The dots represent individual people. The green dots are people who the drug works for the red dots represent people whom the drug just makes them feel worse. We can see that for the most part, the drug works for people with low transcription of gene. X, and for the most part, the drug does not work for people with high transcription of gene acts in the middle. We see that there’s overlap and that the no obvious cutoff for who to give the drug to in summary gene. X does an okay job at telling us who should take the drug and who shouldn’t can we do better? What if we used more than one gene to make a decision? Here’s an example of using two genes to decide who gets the drug and who doesn’t on the x-axi’s. We have Gene X and on the y-axis we have Gene Y. Now that we have two genes, we can draw a line that separates the two categories, the green, where the drug works and the red where the drug doesn’t work and we can see that using two genes does a better job separating the two categories than just using one gene, however, it’s not perfect with using three genes. Be even better here. I’ve got an example where we’re trying to use three genes to decide who gets the drug and who doesn’t Gene Z is on the Z Axis, which represents depth, so imagine a line going through your computer screen and into the wall behind it and the big circles or the big samples are the ones that are closer to you. And the smaller circles. Smaller samples are the ones that are further away and those are along the Z axis. When we have three dimensions, we use a plane to try to separate the two categories now. I’ll be honest. I drew this picture, but even for me, it’s hard to tell. If this plane separates the two categories correctly, it’s hard for us to visualize three dimensions on a flag computer screen. We need to be able to rotate the figure and look at it from different angles to really know, and that’s tedious. What if we need four or more genes to separate two categories? Well, the first problem is, we can’t draw a four dimensional graph or a 10,000 dimensional graph. We just can’t draw it. That’s a bummer wha-wha. We ran into the same problem when we talked about Principal Component Analysis or PCA. And if you don’t know about principal component analysis, be sure to check out the stat quest on that subject, it’s got a lot of likes, and it’s helped a lot of people understand how it works and what it does PCA. If you can remember about, it reduces dimensions by focusing on genes with the most variation, this is incredibly useful when you’re plotting data with a lot of dimensions or a lot of genes onto a simple. XY plot, However, in this case we’re not super interested in the genes with the most variation instead we’re interested in maximizing the separable TI between the two groups so that we can make the best decisions linear discriminant analysis. LDA is like PCA. It reduces dimensions. However, it focuses on maximizing the separable TI among the categories. Let’s repeat that to emphasize the point linear discriminant analysis. LDA is like PCA. But it focuses on maximizing the separable ax T among the known categories. Here we’re going to start with a super simple example we’re just going to try to reduce a two-dimensional graph to a 1d graph that is to say we want to take this two-dimensional graph, AKA and XY graph and reduce it to a one-dimensional graph, aka a number line in such a way that maximizes the separable ti of the two categories. What’s the best way to reduce the dimensions? Well to answer that, let’s start by looking at a bad way and understanding what its flaws are. One bad option would be to ignore Gene. Y, and if we did that, we would just project the data down on to the X axis. This is bad because it ignores the useful information that gene. Y provides projecting the genes onto the y axis, ie, ignoring the gene. X isn’t any better. LDA provides a better way here. We’re going to try to reduce this. Two-dimensional graph to a 1d graph using LDA LDA uses the information from both genes to create a new access and it projects the data onto this new axis in a way to maximize the separation of the two categories. So the general concept here is that LDA creates a new axis and it projects the data onto that new access in a way that maximizes the separation of the two categories. Now, let’s look at the nitty-gritty details and figure out how Lda does that. How does Lda create the new axis? The new axis is created, according to two criteria that are considered simultaneously. The first criteria is that once the data is projected onto the new axis. We want to maximize the distance between the two means here. We have a green Mu character, which is a Greek character representing the mean for the green category and a Red Mu representing the mean for the red category. The second criteria is that we want to minimize the variation, which LDA calls scatter and is represented by s squared within each category on the left side. We see the scatter around the green dots on the right side. We see the scatter around the red dots, and this is how we consider those two criteria simultaneously. We have a ratio of the difference between the two mean squared over the sum of the scatter. The numerator is squared because we don’t know if the Green Mu is going to be larger than the red view or the red Muse going to be larger than the green meter. We don’t want that number to be negative. We want it to be positive, so whatever it is, um, whether it’s negative or positive, begin with, we square it and it becomes a positive number now. Ideally, the numerator would be very large. There’d be a big difference or a big distance between the two means, and ideally, the denominator would be very small in that the scatter the variation of the data around each mean in each category would be small now. I know this isn’t a very complicated equation, but to make things simple later on in this discussion, Let’s call the difference between the two means D for distance, so we can replace the difference between the two means with D now. I want to show you. An example of why both the distance between the two means and the scatter are important. Here’s a new data set. We still just have two categories, green and red. In this case, there’s a little bit of overlap on the y-axis, but lots of spread along the x-axis. If we only maximize the distance between the means, then we’ll get something like this. And the result is we’ll have a lot of overlap in the middle. This isn’t great separation, however, if we optimize the distance between the means and the scatter, then we get nice separation here. The means are a little closer to each other than they were in the graph on the top, but the scatter is much less, so if we optimize both criteria at the same time, we can get good separation, so what if we have more than two genes that is to say what if we have more than two dimensions? The good news is that the process is the same, we create a new access that maximizes the distance between the means for the two categories while minimizing the scatter, So here’s an example of trying to do. Lda, with three genes. We’ve got that three dimensional graph That I showed you earlier here. We’ve created a new axis and the data are projected onto the new axis. This new access was chosen to maximize the distance between the two means between the two categories that is while minimizing the scatter. What if we have three categories in this case? Two things change, but just barely. Here’s a plot that has two genes, but now we have three categories the first difference between having three categories as opposed to just two categories like we have four is how we measure the distances among the means instead of just measuring the distance between the two means we first find a point that is central to all of the data, then we measure the distances between a point that is central in each category and the main central point now we want to maximize the distance between each category in the central point while minimizing the scatter for each category, and here’s the equation that we want to optimize and this is the same equation as before. But now there are terms for the blue category. The second difference is. Ld a creates two axes to separate the data. This is because the three central points for each category define a plane remember from high school, two points to find a line and three points to find a plane That is to say we create new X and Y axes. However, these are now optimized to separate the categories when we only use two genes. This is no big deal. The data started out on an XY plot and plotting them on a new XY plot doesn’t change all that much. But what if we use data from 10,000 genes that would mean we need 10,000 dimensions to draw the data suddenly being able to create two axes that maximize? The separation of the three categories is super cool, its way better than drawing a 10,000 dimension figure that we can’t even imagine what it would look like. Here’s an example using real data. I’m trying to separate three categories, and I’ve got 10,000 genes plotting the raw data would require 10,000 axes. We used LDA to reduce the number to two, and although the separation isn’t perfect, it is still easy to see three separate categories. Now let’s use that same data set to compare. Lda to PCA. Here’s the LDA plot that we saw before. And now we’ve applied PCA to the exact same set of genes. PCA doesn’t separate the categories nearly as well. We can see lots of overlap between the black and the blue points, however, PCA wasn’t even trying to separate those categories. It was just looking for the genes with the most variation, so we’ve seen the differences between LDA and PCA, but now let’s talk about some of the similarities. The first similarity is that both methods rank the new axes that they create in order of importance. PC won the first new access that PCA creates accounts for the most variation in the data, Likewise PC to the second new access does the second best job and this goes on and on for the number of axes that are created from the data ld1. The first new axis that LDA creates accounts for the most variation between the categories LD to the second new access does the second best job, etc, etc, etc. Also both methods let you dig in and see which genes are driving the new axes in PCA. This means looking at the loading scores in LD. A one thing you can do is look and see which which genes or which variables correlate with the new axes. So in summary. Ld a is like PCA. Both try to reduce dimensions. PCA does this by looking at the genes with the most variation, in contrast, LDA tries to maximize the separation of known categories. And that’s it tune in next time for another exciting stat quest.