Transcript:

Hey, everyone, this is Sabtashi here, and I welcome you back to our course on machine learning using python. So today, we are going to look at linear discriminant analysis, which is popularly known as LTA. Okay, and we are at the hands-on session. We’ll do the hands-on in the kaggle environment, so this will be available as a public notebook, so you can actually change the code run as many experiments as you want and see the result by yourself. Okay, so let us do a quick recap. What Lda does is It creates a linear combination of the original features or attributes that are there in the data. Okay, so if you have original features like x1 x2 and x3 a linear component can be created like LD1, which is 0.5 into x1 plus 0.2 into x2 plus 0.3 into x3 Okay, how many such linear components can be there, see? If I have n classes, such components will be N minus 1 If I have 10 classes, linear components will be 9 all right, so you might think that this sounds very similar to principal component analysis or PCA, because that also creates linear combination of original features. So what’s the difference? So PCA tries to create components such that the components creates maximum variability, whereas LDA creates components such that the classes are well separated. All right, so let’s go ahead and now start by doing some housekeeping of importing our regular libraries like Numpy Panda’s Matplotli. Some few things from scalar data sets PCA discriminant analysis and also this color so that we can. Ah, you know, see or visualize our graphs better. So let us run this like every time. Our session is organized around five questions first. We’ll pick the iris data and we will apply LDA and look at the effect or we’ll learn how to fit ld on ios data. Then we will try to understand that how LDA and PC are different again with context to ID State. We will look at how LDA can also be used as a classifier tool, not only as a dimension reduction tool on ibis data set next we move on to mnist, which is a much larger data set. And we will try to look at. You know, different things like classification visualization on mnis’t finally, very briefly. We will try to see that. Can we kind of combine LDA and PCA? Okay, so let’s get started, okay, So our first question is how we fit ld on ios data, so let’s start by loading our iris data set, so you know it is available in data sets, so once we use this load iris method, I can get the data or the independent variable in Irisdata and the class in iristarget. Okay, so let me run this. So if you remember X is now a numpy array 2d array. Okay, so which has got, you know, four elements in each of the rows. So these are your independent variables, And if you look at y okay, so y actually has values 0 1 and 2 because there are three classes. Now, let’s see. How do I fit ld on Ids? Okay, so I simply call linear discriminant analysis. This is already imported. I have used only one attribute N components equal to two. This is also not required because as number of classes is three, it will automatically assume number of components will be two. Then what we do is we do A LDA dot fit on X and Y So X is your input. Y is your output! Please remember that Y is required because this is supervised and then we we do a transform, okay, So transform means. Actually, we create this. You know these linear combinations now out of the original features. So let me run this. Okay, So if now if you look at X, okay, so the elements are 5.1 3.5 1.4 0.2 like that now. If you run here, X R 2 now look at what happens so now, if you see that number of elements are two. So this is where dimensionality reduction is happening, right, so from four features, you are coming down to two features. Okay, now, if you remember that out of this class separability variance, how much is each linear discriminant component describes or explains, so that is available in LDA DOT explained variance ratio and like principle component, the first one will explain more percentage, so this total will be hundred And actually, this means the first linear component explains point one two percent of the variability. Okay, now so this is about. Ah, how we can fit just do it. Let us just do a scatter plot and see how the classes are well separated. So if you look at this, so you see that, okay, In in terms of this, these two linear components, the classes are quite well separated. Okay, so this is how LDA can be applied on Iris data. Let’s go to the next question. Which is how LDA and PCA varies on ibis data set, so let us do by fitting principle component. OK, so N components. We are taking as four. So four is the maximum because there are four features. All right, however, Lda, you cannot go beyond. Ah, you cannot go beyond two features, right, Lda is already created, so we are not fitting it again. So first what we are going to do is we are just going to visualize this using scatter plots. VCB so one scatter plot. We are going to look at, uh, the principal components another. We are going to look at linear discriminant components and the classes will be marked in different colors. So this will help us to understand how separable these representations are so the top one is PCA. And if you think maybe, uh, this one, this one. This class is well separated. But among these two LDA is separating them better than PCA. Right so as you know that, LDA focuses more on plus separability, however, what we do now is we are going to create a data frame where we keep the first two linear discriminant components and the first two principle components as well. Okay, so let us take this and what we are going to do is. We are going to look at the box plot of of this linear discriminant components first. Okay, so this is the box plot and you can see that as your LD1 represents maximum class separability. When we plot against LD1, these classes are well separated, so this each box corresponds to one of the classes. Okay, whereas if you look at ld2 the classes are, you know, quite mixed up? Okay, you can understand why it is. Because second ld only only represents some percentage less than one of the class separability. Okay, we can compare across main linear, discriminant linear discriminant component main principle component again using box plot. So let us run this. Okay, and if you look at the box plot, the first one is ld1 so again, you can realize that this LD1 has created more well separated classes than PC1, okay, so this is where the difference between principal component and linear discipline component is OK. So now let us look at how LDA can be worked as a classifier. Okay, so we do our regular trend test split, and then what we do is we do a Lda dot fit. Okay, so you remember how we fit a classifier on testing data, we just use LDA DOT predict or the classifier credit, okay, and then we simply calculate the classification accuracy score. Okay, so if we run this, we get a classification accuracy of 97.36 percent. So which is a good accuracy on Iris data set. Okay, so this is how LDA not only can be used as a dimensional reductionity tool, but it can also be used as a classifier. Okay, now let’s look at how we can use ld on Mnist data so mnist. If you remember is a quite large data set, which has 10 classes and several rows. Okay, and each of the rows actually has 784 features. Because you know each, uh, each row is actually a 28 into 28 matrix. Okay, 28 into 28 matrix. So it turns out to be 784 so if I look at one of the rows, okay, the first row, basically, so it has got 785 columns, so the first one is level. Which is your target fit and rest. 784 Are your input variable? Okay, so now let’s do this thing. Lets, uh, create X train and white trend. Okay, so this is your first column and rest of the columns will go to your independent variables. So at first, we are just going to fit our led, okay. We have used number of components as 9 again just for our interpretability. We cannot go beyond number of components equal to 9 as number of classes is 10 so we are using LDA. And then we are doing a fit and transform so the components are getting created and then original features are getting transformed as well. Okay, so if we just do fit, the features doesn’t get converted by on. It’s own so now. If I look at X Train, r2 The most important thing that you will notice is that your array size is reduced drastically, so from 784 features, you have come down to nine, which is around one percent of the original features, right, So this is kind of magic about Lda. Okay, all right, so let us look at the explained variance. Ah explained variance ratio, So this is not like PCA, where the first component was explaining 99 here. The first component explains around 23.9 percent, then then 20.1 then 17.84 and so on and so forth. Next what we want to do Is we want to look at how this can be visualized. So if I one of the things of Lda was also visualization, you have 784 features, so you cannot visualize right, so if I use the first two linear Ah, you know, discriminant components and then we actually run a scatter plot. Maybe we can get some representation. However, it represents only 44 percent of the variability. Let’s still run and also let’s do this thing, you know, just to just to establish the superiority of these linear discriminant components that if I pick any two original features, okay, and I plot the classes, We mark the classes in different colors. How these two scatter plots will look like, you know, against one another, so lets. Run this and you know, ah, so essentially what we are doing over here is we are plotting the two scatter plots. So these are your original features and different color circles means that they belong from different digits. Okay, so even using two, this two linear discriminant components. If you see, ah, the classes are quite quite less mixed up. All right, the you can see that zero is well separated. You can also see. Maybe eight is well separated. Maybe six is well separated. However, you can also see There is some overlapping eight and nine, but this appears to be completely random. Okay, so now let’s see another visualization so here we have picked the first two linear components, right, which, you know, explains the maximum variability, so if I instead use the last two linear ah components and compare with the first two linear discriminant components how this should look like. Okay, so if I go down, I I will expect that it will definitely be less random than picking the original feature. So yes, it is less random, okay, However, it is much more mixed up than ah, using the first two linear discriminant components. Right, so this is where you know this. This two linear distribution components are far superior. OK, now let’s also look at the accuracy that we can achieve on the mnist data set so for that, we are going to use this X test and y test that comes from a separate CSV file. Here we don’t need to do the print test split. Now let’s just, you know, use and find the accuracy score. One interesting thing is that you can actually use ldrpredict on the original feature space itself, so it takes care of the transformation and on the transformation, then it tries to predict okay. So the prediction accuracy comes to 87.3 percent. Uh, considering it is a 10 class problem. It is not too bad and also considering that it uses only nine features instead of the 784 original features. Okay, this is a small example where we are trying to combine ld and PCA. So what we do is we again Load Iris, Okay, we set up our X and Y, And then we first fit LDA, with number of components equal to 2 then we fit PCA with number of components equal to four. Now what we are going to do is we are going to do a very interesting visualization so first we are going to look at the scatter plot of PCA. Okay, so we take the first two principal components in LDA. We take the next two linear discriminant components and the last one what we do is we take one principal component and one linear discriminant componen’t. Okay, so let us run this and see how these scatter plots. Looks like, okay, So if you look at the scatter plots, you will see that. You know a PCA. Yes, you know, uh, it is, uh. It is quite interesting. Okay, it is not so well separated as LDA. But if you look at PCA and Lda, you will see that classes are much more well separated and, you know, in PCA, only these axis separates them, right, however, if you look at ld and PCA, both the axis are well separating them. Okay, so this is something which can be, uh, can be hint for, or Q for your further work for where you can experiment by combining and mixing and matching with different transformation components. Okay, so that brings us to the end of our notebook. Thank you so much for watching guys. If you have any questions, please put it. If you find it helpful, please like and subscribe. This will be very, very motivating for us. Thank you so much!