Transcript:

Hello, everyone! This tutorial goes over. PCA, using Python PCA, or otherwise known as principal component analysis, is when the most commonly used unsupervised learning algorithms, so PCA is inherently a dimensionality reduction algorithm. And we’ll go over what that means throughout this tutorial. Okay, so we’re gonna be going over a PCA for data visualization, basically, when you have higher dimensional data than two or three dimensions that you want to reduce to two or three dimensions, so you can visualize it. So you hopefully understand your data better. When you do any sort of machine learning a lot of times, it’s nice to understand your data just in general, really, okay, The second part, we’ll be going over is PCA to speed up machine learning algorithms so in a previous tutorial logistic regression using Python? One of the things we briefly briefly briefly briefly went over, was that just by changing one of the parameters to logistic regression by changing the solver, ie, the optimization algorithm, we greatly sped up how long it took to fit our algorithm how long it took to fit our algorithm? Okay, so in this tutorial were me doing a way way way more common way of speeding up or algorithm by using PCA to speed up the fitting of our machine learning algorithm, and that’s the second part this tutorial and I should note that the code used in this tutorial will be down below as well as this blog post that I’m going through throughout this video, so all I have to do is click on these links and you’ll have access to all the code. OK, so I’m using anaconda in this tutorial. If you need help installing it, I have so many tutorials on this. It’s nine funny, OK, and feel free to ask questions on that too. Okay, so we’re gonna be using the iris dataset to, you know, apply? PCA for data visualization. So the important thing to note is first we’re loading the IRS data set into a panda’s data frame. And one thing you’ll notice is that this data is four dimensional. We have, you know. Four features Sepal length, sepal width petal lane petal width. Okay, four features, it’s really hard to visualize four dimensional data. Okay, so we’re going to use PCA to reduce our four dimensional data into two dimensions so that we can plot and understand our data, okay, and, of course, we have our target, and our target is typically what people use in supervised learning algorithms. What they’re trying to predict basically so one thing. I really want you guys to get out of this tutorial. Is that use PCA? You need to standardize your data, okay, So PCA is an unsupervised learning algorithm that is affected by scale most algorithms are gonna be affected by scale to some degree, so we’re be using a standard scaler to standardize the data sets features, ie. The pedaling little sepal width put a length petal width and that’s it onto the unit scale, which means mean of zero and variance of one. Okay, just to emphasize this point even further, so I could learn has a wonderful section on it. That goes over the importance of standardizing your data and this could have a major impact if you ever do machine learning with prediction accuracy as well as you know PCA for data visualization. Okay, so what I have over here and I should note that the code that produced all this stuff. This is just a Jupiter notebook and it’s right here, and you’ll have the ability to downloaded use it. You know whatever you want, okay it. I just find this a little bit easier to visualize at least for tutorial purposes, so basically, what this code is doing is separating out the features because you standardized the features, not the target. Okay, and the target is just the this column over here called target, okay, and we’re fitting and transforming our features into unit scale with a mean of 0 and a variance of 1 Okay, so now we do a PC, A projection to two dimensions. So what you have to do Is you have to import PCA and we’re basically making an instance we’re basically saying I want a PCA and I want to keep to principal components. Okay, so we make an instance of the PCA class. Okay, and then we’re gonna fit and transform our features, and we’re gonna get a two-dimensional data, okay, with two principal components. I should say okay, and basically, we have something. That’s four dimensional and after we apply PCA. We have two dimensions, okay, So this over here is basically just combining these two prints of component two principal components with the target again so that we have our final data frame, which is basically the two principal components and our target. Okay, so the whole point of this exercise was to go from four dimensional data into something that we could visualize something. We could plot so that hopefully we can understand our data better. Okay, so I have some matplotlib code here. If you have any questions about it, let no. I’m happy to help okay and what you’re gonna see in. This plot over here is that Iris Atossa is very different than Iris Versicolor and Iris Virginica. Okay, so we understood something a little bit more about our data. Okay, um, one thing. I really want you guys to, you know, learn about PCA is we went from a four dimensional space to a two dimensional space. Okay, so anytime you go from a higher dimensional representation, so as you can see here, we have four dimensional data, Okay, and then, you know, after running all this code, we went down to two dimensional data. Okay, there is some information that’s gonna be lost. Okay, so the way. PCA accounts for this or the way you can think about. It is by a explained variance ratio, so these first two principal components take up ninety five point, eight percent of the variance or the information. Okay, um, with the first principal component accounting for seventy two point, seven seven percent and the second twenty three and the remaining two with the rest of the variance. One thing to really, you know, note that’s important is that if you try to go down to two principal components and you’re below Eighty five percent of the information, it may not be the best idea to to visualize that as an accurate representation Because you lost a lot of your variance or your information when you went from four dimensions or, however, many dimensions to two dimensions. Okay, but since were above, you know, 85% roughly, This is more than a valid way to visualize our data, okay. Um, so next we’re to go over. PCA to speed up our machine learning algorithm. Okay, so, okay, so the just about the most common application of PCA that I know of is for speeding up machine learning algorithms, so the reason why we’re not going to use the iris dataset like we use up above here is because that’s a very, very small data set. The data set is a like concerted 22 set, which means it’s a like a data set. You know, just applying algorithm. It’s very small, so if we use, you know? PCA here and then a machine learning algorithm. We wouldn’t really see a difference in how long it took to fit or algorithm because it’s already so fast on the iris dataset. It’s a small so instead we’re gonna use the mint’s database of handwritten digits. Otherwise, no one is missed, okay. This has 784 features. Barris’s only had four as you remember. It has 60,000 training examples and a test set of 10,000 Okay, what it means. By 784 feature columns is that we’ll have 28 by 28 images. Okay, and inside each one, Those columns will just be pixel intensities. Okay, so the first thing you have to do and again, all this code will be provided down below, feel free to use it as your own. Okay, so the first thing you have to do is get the data set, so we’re gonna use. SK learns fetch ML data to get the Mint’s data set. There’s a bunch of different ways to get this data set. This is just the way. I chose okay so inside. This data set when we download it. You’ll see that mints dot data. These are the images in the data set. It’s you know. Seventy thousand Total images that are 784 dimensional, or 28 by 28 images by the way 28 Times 28 is 784 Okay, and mints thought target. These are just the labels corresponding to each one of these images above. Okay, so what we’re gonna do? I’m sorry, this is this notebook. Okay, is after we download this data set. We’re gonna do with a typical splitting our data into training and test sets, so typically you split your train set into or your data set into 80% training and 20% tests in this case. I chose 6/7 of the data to be training because I wanted 60,000 training images and 10,000 test images. Okay, so this is just SK learns train test split, Okay, and one thing I fret, but over here. Is that import train test fling? Okay, you’ll see over here that you have to import it from sq learn model selection. Okay, so like before we have to standardize our data. I mentioned this early in the tutorial so again. We use standard scaler. So one thing you’ll start noticing about a lot of machine learning, you know, algorithms, you have some sort of process or some sort of pipeline where you first, you know, standardize your data, Then you’ll import and apply. PCA, so I import PCA, and then I make an instance of the model. Okay, and the difference from before. Is that what it means? This the model before we had number of components equals to let me just go up, so I can show you. We had PCA and components equals to okay. We basically said. Ref the bat that we want to go down to two components. Okay, in this case, what I’m telling SK learn is that please choose the minimum number components such that 95% of the variance is retained. Okay, so what SK learn is gonna do is it has a curve where it’ll find out. What’s the minimum number of components such that 95% of the variance is retained. Okay, and if you want to find out how many components that is you can do. PCA dot and components, okay, and that really amounts to 154 principal components. Okay, and again we’re doing this to speed up our machine learning algorithm. Okay, so important thing. A note is just like with any sort of algorithm where you fit your algorithm on the training set. You do the same thing with PCA on your training set and then once you fit PCA, a new training set, you apply that transform to both the training set and the test set. Okay, and then from there, it’s just like you have a normal algorithm from. SK learned you import the model you want to use in this case from SK Learn Linear model import logistic regression. And let me just do this inside the jeeper’s notebook. So you can see, okay. Okay, let me wait till my computer catches up, So I made an instance the model. I’m finding PCA on just the training set. Okay, on your computer. It probably be faster because right now. I’m recording this video, so it’s slowing down my computer quite a bit, okay, and I applied my transform to both the training and test set. Okay, and from here, it’s just import whatever them all you want to use. I chose with just other algorithms will actually be a lot better for this case. It just live. Discretion is a very common algorithm. I make an instance my model, okay. I am fitting logistic regression, Okay, so when I’m fitting the model, it’s learning the relationship between the digits and the labels. Okay, so this is really what we’re trying to speed up when we do PCA to speed up machine learning algorithms so because I went down from 784 components in the original data to I think 95 percent of the variance corresponds to 154 components. This is really what’s you know. Suppose to be sped up by using. PCA. Okay, so I just got done fitting my algorithm and then from here you can measure your model performance, Okay, So this blog post goes over the steps my little bit cleaner than just giving you guys do clear notebooks. You guys like the format? Let me know if you don’t let me know as well, okay, and then one thing. I want to note is the timing of a logistic regression after PCA, so for different number of principal components kept if I kept more percent of the variance, back up 100 percent of the variance. Basically, if I’d really been applied PCA, it took roughly, You know, forty eight point nine four seconds on my Macbook with an accuracy of 0.9 one five eight. And then, you know, variate and then from there, I just kept various percentage of the variance, okay. So, with 85 percent of the variance retained that amounted to 95 principle components and it took eight point eight five seconds to train my model okay to fit the model and it really didn’t affect the accuracy in this case. Okay, so one thing I want to note is we can also do an inverse transform, so we went from 784 components down to 59 okay, or 784 dimensions, down to 59 dimensions, okay, and speed up an algorithm, so this entire tutorial up until now we’ve gone from higher dimensional data to lower dimensional data, so PCA can also take the lower dimensional data, ie, the compress representation, the data back to an approximation of the original high dimensional data. Okay, so after I ran PCA For 95% with keeping 95% of the variance, I was able to do an inverse transform to get back to an approximation of the original 784 dimensional data. Okay, if you want to see this code and how it works. I have a link to it over here. Okay, so you can see it’s really just, you know? After fitting the transform going from high dimensional data to lower dimensional data, ie 154 components, you can also do the inverse transform to get back to an approximation of the original data. Okay, and then the image. I have in this blog post at the top. This just is basically just after applying. PCA going down to like, let’s say a city. Nine components on this rightmost image. I just did an inverse transform to get this to get back to the 28 by 28 image. Okay, so that’s really it for this tutorial. Please let me know if you have any questions. I’m happy to help, and I should note that if you have a good question, I might just answer it on this blog post or in the comments down below the Youtube video and that’s it. Please subscribe. If you want more content, leave feedback if you want and that’s it bye.