Transcript:

Welcome to the lesson. Video and LDA visualizations in this video we’re going to talk about a method using the theory behind linear discriminant analysis to create a projection in the lower dimensions or a dimension reduction method based on linear discriminant alysus. So what are we going to talk about? We’re gonna talk about this? LDA projection, which is a projection into lower dimensions that maximizes the distance between class means, but at the same time, minimizes the within class spread, so it’s going to give us the best way to view separation between classes or at least linear separation between classes When we have class labels, so let’s talk a little bit about what we mean by projection and what would happen into projection in different directions and so here. I have class data. A red data set in a red class in a green class in my data set and I’m looking at a projection onto this line here and so what I do is I take for projection. You take each point in your data set and you follow the line perpendicular to the line you’re projecting on, and that tells you where that point will be after that projection, and so here’s my original data and then after projection. I have this data sitting in the line that I’ve projected. Now This first projection that we pick takes our two classes and protects them in a direction that’s going to maybe maximize variants have come close to maximizing variance of the total data set, but it squashes the classes on top of each other, and so we don’t see any class separation. That first plot is what PC along the lines are what PCA does. Its projection is in the direction of greatest variation of the total data. This plot on the right hand side is what we’re looking for. We’re going to take our two data sets and we’re gonna project it in some direction. Same two data sets, but now we’re projecting kind of a different line this line here and we’re projecting onto that line and we see it preserves the separation between the classes, so now we could view our data in one dimension and and see the separation between classes, Of course, we care more about projecting data from hundred dimensions down to two or three. We can look at them. Then we do about projecting things into one dimension. If it is two dimensional, we don’t really need to project it. So here’s another way to look at that class projection, so suppose? I have two classes again here. Red and green. Let’s think about how to do this class separation. So that maximizes the separation code. Think about how to do this projection that maximizes a separation between my classes, so the first thought you might have is projection the direction that maximizes separation between the means, so there’s the line connecting the means, and if I project onto that line, that’s going to maximize the separation between the means after my projection. So that’s what this would look like. This line a1 is a line that’s parallel to the line between the means, And if I project onto this line, I get well. I get what I see in the picture here. My classes are now overlapping, so I’ve taken things in two dimensions where they had some separation. I’ve projected down the one dimension, and I’ve I’ve projected them in a way that doesn’t give good class separation, right, My distance between my means is preserved, right that distance. It’s the same as original distance between the means If I project onto any other line, I’m gonna have those means will be closer to each other, then if I project under this line, but I get class overlap, So what I really want to do? In linear discriminant is projection is projecting a direction that maximizes this separation between means, but also minimizes the spread of the class right the way. I projected these classes. They have a lot of spread or variants after the projection, so this new projection, a two still keeps pretty good separation between the means, but the variance or the spread of my classes here is minimized, so that’s the trade-off we’re looking for. We’re looking for a spread that maximize the separation between the means and minimizes the within class break. That’s what the linear discriminant. Alice’s project is going to give us and so a2 is better accounting separation because we accounted for the variants of the spread within the individual classes. One piece of notation we’re going to now go into talking about how we compute this separation. This projection that we’re looking for, and we’re gonna need some notation and so the projection of a point. XJ in the direction of a vector, a write the vector, a determines my lines is written by this. A transpose matrix multiplied by X J. Now we’re gonna get into some some linear algebra and some formal some formalism with our linear algebra and some calculus and some optimization. It’s not really required to know and be able to follow where you produce all these formulas, but it’s really nice to see where this stuff comes from, you see, we’re looking the the sort of machinery going on behind linear discriminate out so so we’re gonna kind of give a summary level. We’ll see how the details, but we’ll go through and kind of quickly be gonna give a summary level understanding without, but I going taking too much time to get bogged down into the into the details of what’s going on here, so okay, so the goal is to find the projection. Haeundae that maximizes the distance between the means, and we’re going to normalize by some sort of variance after the projection so to do this, we’re gonna need to start with formula so the mean of class. I this is before projection. Is this so this is just obviously the formula for the mean of class. I so that’s not anything new now. We define the scatter of class. I to be this and so this formula is the same form that we’ve seen for the covariance of a class. It’s the covariance matrix. Except there is no multiplying by dividing by number of samples minus 1 out front, so this scatter is just the covariance times. The number of samples minus one, so were not divided by number of samples minus one to get an actual covariance, so scatter is very similar to go grades. So what are we gonna do? We can, I compute the mean of each class after projection so two more definitions and these are fairly straightforward. This is a mean of class. I after projection, right, This is the formula for the projection of X Point. XJ we’re gonna sum up all the projected points. XJ divided by 1 over N. And that gives us the mean after projection. So this little tilde on top of a letter means that’s the value for that that object after it’s been projected. And so here, we have the scatter after projection, right, It’s si with a tilde. The tilde. On top means we’re doing half the protection, and then we just look at this formula for s, but we’ve projected each of the terms inside, and so these are mean and scatter before projection and mean and scatter after projection and we need these these formulas because we want to maximize the distance between means after projection, but minimize the scatter after projection, so we’re going to do those at the same time, so we want to maximize the spread between those means, and minimize the sum of the scatter for all our classes after projection. Right, so another way to say that is we’re gonna find the value of 4a A is a vector, so we’re gonna find the vector a that maximizes the value of this right, so we want the distance between the means to be large and the total spread to be small, so we’re maximizing the ratio of distance between the means divided by the scatter right. So that’s what we’re gonna try to do. That’s that’s the goal for linear discriminant. I’m going to find the A that maximizes this. Javi so Jo. Vae’s is a a single variable output. So did to do this. We have a little more computation and this is fairly straightforward. These are just moving around things. This is the mean after projection and this is the formula for the mean after projection on the previous page. And now what we say, Is this this a transpose? J, you can pull a transpose outside the sum right here, and that’s what we have written here that a transpose is outside the sum, and then this sum is just the mean of class. J Mu I and so we have that. So this first computation says that the mean after projection is the projection of the mean very similar thing with the scatter. This is the scatter after projection and you can follow through those, but we show through those series of just some linear algebra showing that that’s equal to what you get when you take a transpose times, the scatter matrix times a and so using those two if we come down here and look at this J A we’re trying to find the value of a that maximizes this right that J A is equal to this formula here, using those computations that we just did which if we write it out a little more carefully, is this here and we’re being a little informal, right, that’s? Where at this top here? We really mean what’s written here, and then we can take these. A transpose is here and here low. A little little bit of linear algebra says we can bring them all the way out to the outside, so there’s a bunch of linear algebra. We’re walking through here and to the degree that you’ve seen this type of linear algebra before it should make some sense. If you haven’t, It’s good to walk through these and see some of the stuff that goes on under underneath the hood. He gets them familiar with how these things are computed, so we want to maximize the value of J of a and so here’s J. Here’s our our formula J of a and to understand this better. We’re going to make some more definitions, so we’re defining the within class scatter, and that’s SW. It’s the sum of the scatters for those two individual scan classes, so within class scatter for the total data set, it’s be some of the scatters for the individual classes, And that’s given by this formula here to sum over all X of X minus. This Mu here is the mean of whatever class. X is contained in and this formula is familiar from Covariance matrix formula, And then the between class scatter. That’s s subscript B for between crest scatter, and it’s the scatter formula but applied to the mean. So this is the mean of class. I and this is the mean of all of the means where in this case, we only have two means, so it’s a little bit simpler of a situation. Okay, so what we really want to do is minimize within class, scatter and maximize between class scatter and that you understand that, even if you didn’t follow every all the all the details, every formula you understand, we want to maximize the between class, scatter and minimizing within class scatter. That’s a good understanding of what linear discriminant alysu’s projection does. And so here’s. J of a this is the thing the function that we want to find the maximum value where we want to find the A that gives us a maximum value here so to find a maximum value we use in calculus, so we’re going to take the derivative of J of a and set that equal to zero, so the dirty of J of a here we just put in the formula for J. Ma, and then this equal sign. We just use the product rule from calculus. Now we’re going to multiply both sides by this piece in the bottom, Of course, it cancels that in the bottom of that formula on the left-hand side and a multiplying by 0 by that just leaves you with a 0 on the right-hand side next step in our computation, we multiply both sides by by this. And now this piece right here is equal to 1 and this piece right here is J of A So we use those two pieces of information to get this formula right here. We move those around a little bit. We use the fact that J of a is a scalar and we get to this formula. Then we multiply both sides on the left hand side by SW inverse. That’s a matrix inverse, and we get this formula here and so this is a generalized eigenvalue problem, and it tells us what a has to be. A has to be an eigenvector of that matrix right there and J of a will be its eigen value, so this tells us the direction to project these two classes we can solve for a if if we have a two class problem and there’s the formula for the vector a we take the vector between the means, and then we multiply that on the left by this inverse of the scatter matrix more. Generally, if we have C classes, some arbitrary number, the LD. A row projection will be the projection onto the C minus 1 eigen vectors of this matrix, so remember? PCA was a projection onto the eigen vectors of the covariance matrix. Ld a is the projection under the eigenvectors of this matrix here in these two terms. This is the between class scatter, and this is the within class. Scatter, remember. We want to, in a sense maximize between class scatter. It’s the kind of the scatter between the class means at the same time, minimizing the scatter within or the variance within the classes after the projection so intuitively, that should make a little bit of sense that that’s we be the thing we want to use to compute the LDA projection so again just just to clarify it’s really if there’s a nice end result here. PCA we project onto the eigen vectors of the covariance matrix. Lda, we project under the eigenvectors of this scatter matrix right here. Okay, so let’s let’s go through some examples, and actually, you’ve seen an LDA example in inter in the Interactives and in this plot on the right hand side that you have so the left hand side shows the Iris data, remember? It has four variables. This is a a full all-pairs scatter plot in hard on those four variables with the classes colored by their right by their class labels. And if you look over there, you can see some separation in these individual scatter plots on the left hand side, but none of those show the full class separation, and then we do an LDA projection and we get this plot on the right hand side. This is really a projection on the LDA component one in LD a component two and you can see that we’ve got class separation. It’s a little better than even the best of the individual pair’s plots on the left-hand side, so what we have is a two-dimensional slice through the four dimensional space, where the where the original data lies and that two-dimensional slice if we’ve projected onto that we’re showing here on the right hand side that projection on the LDA components that gives you the best view of the class separation, so we’ve been using that inner Interactive’s to display the four dimensional data in two dimensions in a way that we can see some approximation to how the classes are being separated with the different colored regions. We’ve been using that because we want to visualize some amount of how the different algorithms are separating those classes, and so we need to do that in two dimensions. So that’s why we’ve been using the LD. A separation LD a projection for them, Even though until now we haven’t been able to talk about it. Here’s another example. This is the handwritten digits data that we’ve looked at from time to time and so there’s. I think this is nine classes Z with numbers 0 through 8 I don’t think there’s any nines in here. And on the left-hand side, we’ve taken this data. It’s originally in 256 dimensions, like each dimension corresponds to the brightness of a pixel in the digital image, taken of the handwritten digit and out of that 256 dimensions on the left-hand side. We see PCA projected onto the two first two principal components and this was done in Python and so here’s that the command in Python. It’s using a custom function for doing the plot, which is provided here, but this first line does the linear discriminant analysis projection. I sorry the PCA projection in Python and on the right hand side, we have a linear discriminant analysis again. The first line does that. LG a projection the second line makes it custom plot in this PLT. I show says show that part and on the right hand side, we see the LDA projection. We see much better class separation from LD a than PCA and this is nine classes so to get the full benefit of the projection we would need eight dimensions, so we’ve projected really down from 256 down to eight to see the separation between our classes, but so, but because we can only view a projection or a plot in two dimensions we’re only seeing two of those eight dimensions, but we still see some some quite nice class separation. The fours tend to be up here. The zeros tend to be down here and they have the threes here. The twos are there! Even though there’s still quite a bit of overlap and and so on you can you can make some estimation of where the classes are? There seems to be a lot of overlap between two classes in this region. And you just separate out sort of like color. There’s a lot of overlap there, but you can still see even just a tattoo of the LDA eight dimensions. You can still see a lot of class up so in review. What have we talked about? We’ve talked about linear discriminant. Alice projection, we went through the linear algebra in some rather quick in fly through of a linear algebra and the calculus involved in computing things, but what you should know is that it maximizes the between class scatter. That’s separation between the means while minimizing the within class scatter, and that’s a measurement of the variance within the classes after after the projection. Right, that’s the important thing to know about LG8 projection. Why do we do this? It gives a lower dimensional projection that preserves as much of the linear class separation as possible it can be used to view structures among the classes or to see how much visually see. How much class separation there is? It can also be used as a dimension reduction as in pre-processing similar to PCA. In one final note, there are other methods for dimension reduction that use nonlinear methods for understanding the data and giving you dimension reduction we’re not going to go into those, but it’s worth being aware of those PCA and ice in LDA, or a mouth to common where PCA gives you preserves the variance of the data in LD a minimizes the width in Christ Quatre maximizes the between class scatter as we talked about. Well, thank you very much for watching.