Feature selection in machine learning. My name is Richard Kirschner with the simply learn team. That’s wwwsimplylearn get certified. Get ahead today. We’re going to look at what’s in it for you. The need for feature selection. What is feature selection feature selection methods feature selection statistics, so the need for feature selection to train a model we collect huge quantities of data to help the machine learn better consider a table, which contains information on old cars. The model decides which cars must be crushed for spare parts and when we talk about huge quantities of data there they save everything from people’s favorite cat pictures to, and you can imagine there’s so much data out there, even in a company, they’ll save all these little pieces of information about people and companies and corporations. You need some way to sort through, because if you try to run your models on all of it, you’ll end up with these very clunky models and they might have issues which we’ll talk about later, but in this case, we’re talking about cars and crushing, but not all this data will be useful to us. Some classes, or part of the data may not contribute much to our model and can be dropped. And you can see right here we have. Who was the owner of the car? In our data? A car will not be crushed based on its previous owner. So yeah, that’s kind of a clear cut. You can see that. Why would I care who owned the car before once? It’s in the junkyard and we’re crushing the cars. Were not going to really care about that so here we have dropped the owner column as it does not contribute to the model. Having too much unnecessary data can cause the model to be slow. The model may also learn from this irrelevant data and be inaccurate, so feature selection is a process of reducing the input variable to your model by using only relevant data and getting rid of the noise in the data. Consider the database given below a library wants to donate some old books to make place in their shells for new books. We want to train a model to automate this task. In this case? The color of the book does not matter and keeping it can cause a model to learn to donate books based on color. We can remove this. As a feature using feature selection, we can optimize our model in several ways and so the number one is to prevent learning from noise and overfitting. That’s actually the huge one because we we don’t want it to give us the wrong prediction, and that means also improved accuracy so instead of giving us a wrong prediction, we also want it to be as close to the right answer as we can get and we want to reduce the training time. It’s an exponential growth in some of these models, so the each feature you add in increases that much more training time we talk about feature selection methods. Uh, we put together a nice little flow chart That shows the various methods used for feature selection. And you have your basic feature selection and then there is supervised and unsupervised under supervised. There’s intrinsic wrapper method filter method. So we talk about unsupervised feature selection refers to the method, which does not need the output label class for feature selection. And that was, uh? You can see here under super unsupervised. We don’t have, I mean, that’s really a growing market right now. Unsupervised learning and so feature selection is the same thing supervised feature selection refers to the method, which uses the output label class for the feature selection. And if you remember, we looked at three different, we have intrinsic wrapper and filter method, so we’re going to start with the filter method on this now. Remember, we know what the output is, so we’re going to be looking at that output to see how well it’s doing versus. The features in this method, features are dropped, based on their relation to the output or how they are correlating to the output. You can see here. We have a set of features selecting best feature learning algorithm and then performance and so we want to find out which feature correlates to the performance on the output. Consider the example of book classifier. Here we drop the color column based on simple deduction, and that kind of sums it up in the in a nutshell, as we want to filter out things that clearly do not go with what we’re looking for. We look at the wrapper method in the wrapper method. We split our data into subsets and train a model using this based on the output of the model, we add and subtract features and train the model again and you can see here in the wrapper method. We have a set of features we generate a subset. We run it through the algorithm, and we see how each one of those subset of features performs. Consider the book data set by using the wrapper method. We would use a subset of different features to train the machine and adjust the subset, according to the output. And so you can see here. Let’s say we take a name a number of times red, and we run just those, and we look at the output, and if we looked at him with all four inputs and look at the output, we’d see quite a different variation in there and we might say you know what condition of the book and color really doesn’t affect what we’re looking for and you can see here. We’ve run it on condition of the book and color, depending on the output of the model, we will choose our final set of features. These features will give us the best result for our model and it might come up that the name number of times red is probably pretty important The intrinsic method, this method combines the qualities of both filter and wrapper method to create the best subset. The model will train and check the accuracy of different subsets and select the best among them. We kind of looked at a little overview of some of the stuff. Some of the common feature selection algorithms based on which method they belong to are given below and you’ll see is primarily under supervised. There’s not like I said, a lot of unsupervised methods and the ones that are usually used these methods and finds a way to create a supervised connection between the data and we talk about supervised methods. We have our filter method, which we talked about, and it uses like the Pearson’s coefficient T squared anova coefficient. Those are all under the filter method and in the wrapper method, recursive feature elimination, so remember, we’re choosing a subset and we want to go through there and look at each one, so you’re just doing a lot of loops or recursive calculations to see which one works best and which ones don’t have an impact on the output and there’s a lot of genetic algorithms to go with this, too on the wrapper method and how they evaluate it and with the intrinsic method, there’s the two main ones we’re looking at. Is the lasso regularization. The lasso algorithms are basically your standard regression model, so it’s finding out how these different methods fit together and which ones have the best add together to have the least amount of error. The other one used in the intrinsic method is the decision tree, it says. Hey, if this one is this one produces this result. This one produces this result. Yes, no, which way do we go? Based on the input and the output variables, we can choose our feature selection model. So you have your numeric input coming in. You have a numeric output. If you use the Pearson’s correlation coefficient or Spearman’s rank coefficient, you can then select what features you’re going to feed into that specific model, and you maybe have a numerical input and a categorical input, so we’re going to be looking more at a Nova correlation coefficient or Kindle’s rank coefficient. And if you have a core categorical input and a numerical output, we might be looking at Anova correlation coefficient in Kindle’s rank coefficient, so based on the input and the output variables, we can choose our feature selection model and you can see we have categorical to categorical. We might be looking at the chi-squared test contingency tables and mutual information. Let’s go and take a look and see in the python code. What we’re talking about here and I’m going to go ahead and use for my ide. The Jupiter notebook in the and I always launch it out of anaconda on here and we’ll go ahead and go up here and create a new python 3 module and we’ll call it, uh, feature select and since we’re in Python, we’re going to be working mainly with your numpy. Your pandas, your matplot library. So we have our number array. Our data frame setup, which goes with the number array, the numpy, the panda’s data frame, and then we want to go ahead and graph everything, so we’re going to import these three modules. And then we put together some data. We’re going to read this in. Its Kobe Bryant. I guess he’s a basketball player. Our guys in the back. We have a number of them guys. Both we have a lot of men and women, so it’s probably a misnomer. Our team in the back. They have a some of them, have a liking for basketball, and they know who Kobe Bryant is, and they want to learn a little bit more about Kobe Bryant. What’s going in for what whatever is going on with his game in basketball? So we’re going to take a look at him. And once we import the data, we can see what columns are available, original features count so we can see how many features there are the length of it we’ll actually have a list of them and then print just the data head, the top five rows, and so when we do this, we can see from the CSV file. We have 25 original features. Our original features Are your action type. Combine shot type game event Id and so forth. There’s a lot of features in here that they’re recorded on all of his shots. This is we talk about, like a massive amount of data. I mean, people are sitting there and they record all this stuff, and they import this stuff for different reasons, but depending on what we want to look at. Do we really want all those features? Maybe the question we’re going to ask. Is what’s the chance of him, making any one specific shot in right from the beginning? We can look at the some of these things and say team name. Uh, team name. Probably I don’t know, maybe it does matter because the other team might be really good at defense. Uh, game date. Maybe we don’t really want to look at the game date team. I’d definitely not of importance in any of this, so we look at this. We have 25 features and some of these features just really don’t matter to us. We also have location X location y latitude and longitude. I’m guessing that’s the same data We’ve actually imported the the very similar data. Maybe they’re slightly zoned differently, but as far as our program, we don’t want to repeat data, some of the models when you repeat data into them and this is true for most models create a huge bias. They weigh that data over other data so just at a glance. These are the things were looking at. We want to find out well. How do we get this? These features down and get rid of this bias and all these extraneous features that we don’t really want to spend time running our models on and programming on and as I pointed out, there’s a location X, A location y latitude and longitude lets. Take a look at that and see what we’re looking at here. We’ll go ahead and create a plot of these and we’ll just plot. We’ll do a scatter plot of location, X and location Y, and then we’ll do a scatter plot of data lawn data lat, which is probably longitude and latitude and the scatter plot. Let’s go ahead and actually put a little title here, location and scatter on there. We’ll, just go ahead and plot these, and when you look at this coming in these two graphs are pretty identical, except they’re flipped and so when we look at the location from which they’re shooting from, they’re probably the same, and at this point, we can say, okay, we can get rid of one of these sets of datas we don’t need both X and Y and latitude and longitude, because it’s the same data coming in, and as we look at this particular data, the latitude longitude we might also ask. Does it really make a difference which side of the chord you’re on whether you’re on the left side or the right side and so we might go ahead and explore instead of looking at this as X y we might look at it as a distance and an angle and we can easily compute that, and you can see we can create our data. Distance equals location X. Plus, the location y squared standard euclidean geometry or triangular geometry hypotenuse squared equals the each side squared, and then once we’ve done that we can also compute the angle, so the data angle is based on the arc tangent, uh, and so forth on here, so this is all this is. We’re just going to compute the angle here and then set that up Pi over 2 to get our angle and we’ll go ahead and run that and you’ll see some errors run, come up, and that’s because when we took slices over here, we took a slice of a slice. There’s ways to fix that, but it’s really not important for this example, So if you do see that you want to start looking up here for, um, instead of data location X of, uh, not location x0 This would be like I believe the term is ILO Dot Ilocation. If this was, yeah, this is in pandas. Uh, so there’s different things in there, but for this, it doesn’t really matter. These are just warnings, that’s all they are, and then let’s combine our remaining minutes and seconds column into one. There’s another one. So if you remember up here, we’re trying to get rid of these columns. Do we really need? Let me see if I can find it on here there we go, There’s our minutes remaining, and then they had what was it. It was minus remaining and seconds column, So there’s also a second’s column on here. Let me see if I can find that one. This is where it really gets kind of crazy because here’s our seconds remaining, so you can see that here’s our minutes remaining. This gets crazy when you’re looking at hundreds of these features, and you can see that if if I’m going to say, write a model, that’s going to predict a lot of this, and I want it to run in this case. It’s a basketball and how good his shots are as the data comes in lets. Say I want to have it. Run on your phone. If I’m running it across hundreds of features, it’s going to just hang up on your phone. Where if I can get it down to just a handful, we’ll actually be able to come in here and run it on a smaller device and not use up as much memory or processing power, so we’ll go ahead and take data remaining time here and data minutes remaining times 60 plus data seconds remaining, so we’re just going to combine those and we’ll go ahead and reprint our data So we can see what we got. Um, coming across. We have our action type combined and this is. We do this a lot. Uh, we want to take a look at. Oops, I got so so zoomed in. Let me see if I can zoom out just a little bit there. We go boom. All right, so we come up here. You can see that we now have our distance our angle remaining time, which is now just a number. Uh, that computes both the minutes and seconds together and we still have we’ve. We’ve been adding columns. I thought you said we’re supposed to subtract columns, right. Um, we’re going to delete the obsolete columns when we get to them, so we’re just filtering out and this is the filter method we’re just filtering through the things that we really don’t need and next. Let’s go ahead and explore team. Id and team name. Let me just go ahead and run that. And if you look at this, we have Los Angeles Lakers, and then they have the team I’d here, and they’re unique. There’s not, that’s not really anything That’s going to come up because that’s. This particular athlete’s works for that team, so it’s the same on every line, so there’s another thing we can filter out on there team. Id and tname is just useless. The whole column contains only one value each. And it’s pretty much useless lets. Go ahead and take a look at match up an opponent. That’s an interesting one, and we see here that we have the lal versus por and the opponent is por and ind again. Here’s a lot of duplicate information, so this basically contains the same information on here again. We’re filtering all this stuff out, and this is because we’re only looking at one athlete, and this might change if you’re looking at multiple athletes. That kind of thing now these are easy to see, but we might have something that looks more like this. We might have something where we’re looking at the distance, which we computed and the shot distance. Are they the same thing and what we can do is we can plot that and plot them against each other on here, and we see it just draws a nice straight line and so again we’re looking at the same information so again we’re repeating stuff, and we really don’t want to be running our model on repeat information on here so again, it contains the same information so now let’s look at the shot zone area shot zone, basic shot zone range so now we’re looking at the zones and what does that mean and we’ll go ahead and and do this also in a scatter plot in this case, we’re going to just create three of these side by side, so we’re going to create our plot figure side 20 by 10 and then we’re going to define our scatter plot by category feature and we’re going to do each one set up on here, Give it a slightly different color, and so our shot zone area is going to be plot Subplot 131 Scatter. Three one is how that’s read by the way, meaning that it’s number one. Um, we have three across. And this is the first one down so one. One one! Our scatter plot by category is going to be the shot zone area. We’re going to plot that, and then we’re going to do the shot zone basic and then the shot zone range, and you just push them through our definition, so each of those areas go through, and you’ll see one three one one, three, two one, three, three again. It’s a one by three setup, and then it’s just a place on each one. And so we look at this. We can see that these shots. Uh, they map out the same, so it’s very again. Redundant information. That should be intuitive. Um, when we’re looking at this in this color graphs, it kind of helps you start looking at something. You, it’s very intuitive like this is and you start to realize that some of the stuff you’ll be looking for in data, you might not understand and you’ll see these circular patterns where they match or they mostly match. And you start to realize when you’re looking at these that they’re repetitive data, and then you want to explore them more closely, depending on what domain you’re working in, so we we look at these, we look at them and they look just like the regions of the court, but we already have stored this information in angle and distance columns. So we’ve seen this image before we go back up here, and here’s our the similar image and repeating that image is down here and so lets. Go ahead and drop some of this in this stuff. So now let’s drop all the useless columns and we can drop the shot. ID team I’d. Team name shot Zone area shot zone range shot zone basic, the match up the longitude and latitude because we’re putting that into distance seconds remaining minutes remaining Because we combine that into one column shot distance because we have just distance on there. Location X location Y the game event. I’d game I’d all. This stuff is just being dropped on here and we’ll just go ahead and loop through our drops. And this is a nice way of doing this because as you’re playing with this. Um, this kind of data putting your list into one setup helps. Because then you’re just writing it through an iteration and you can come back and change it. You might be playing with different models and do this with models. You might be looking at all kinds of different things that you can drop and add in as you test these out and again we’re working in the filter method, so this is a lot of human interaction with the data and it takes a lot of critical thinking to look at this stuff and say what matches and what doesn’t and so we look at the remaining features. Let me go ahead and just run this. The original to the new count. We had 25 features. Now we have 11. Features. You can see that right there. We just circle that there’s our 25 and there’s, uh, old new. Now we’re down to 11. So we’ve cut it down to less than half, and you can just see the actual different information on here and the remaining time at this point we filtered it through, and then we’d move into the next process, which would be to run our model on this, and maybe we would drop some of the features and see if it runs better or worse and what happens that’s kind of would be the next step on there versus the filter setup and that would be one of the other setups, depending on which algorithm you use so that wraps up our demo on filter the feature selection and going through and seeing how these different features are being repeated in this particular in a basketball setup, There’s luck. I said there’s a lot of other methods on there, but the filter one is really good. Because that’s where you usually start. You want to have your own visual and try to understand how it works? Thank you for joining us for simply learn. That’s wwwsimplylearnco’m get certified. Get ahead for more information. Please visit Wwwsimple. You can also post information on the Youtube below and we will reply to that. If you want a copy of the data we use for this. You can also request that from there. Hi there, if you like this video, subscribe to the simply learn Youtube channel and click here to watch similar videos. Turn it up and get certified. Click here.