Transcript:
What is going on, everybody? Welcome back to my Youtube channel. Richard on data. And if this is your first time here, my name is Richard. And this is the channel where we talk about all things, Data, data, science, statistics and programming so subscribe for all kinds of content. Just like this if you haven’t already and hit the notification Bell, so Youtube notifies you whenever I upload a video, so this is going to be a very special edition in my R tutorial series. We’re going to kick off a new series in here all about the carrot package. Now, for those of you totally unfamiliar with what that is. Carrot stands for classification and regression ensemble training, This package was developed by our studio and former Pfizer data scientist Max Kuhn and the idea is the problem used to be. You had all these disparate packages for you’d have one for pre-processing one for each method that you want to use one For summarizing the results, it was just too disconnected and all over the place and the idea with carrot is it’s a one-stop shop for all of your machine. Learning needs now. The reason I’m making a whole series out of this Rather than just one tutorial is because this package is absolutely massive. There’s way too much that I could possibly cover in one simple tutorial, But throughout the series we’re going to cover, generally prepping your data set up for machine learning purposes, visualizing the feature distribution by class because we’re a tackling classification rather than regression here, pre-processing the data set. I’m going to show you how to remove low information features algorithmically. If you choose to do that of visualizing feature importance, the various definitions of metrics of performance so things like sensitivity specificity and positive predictive value things that generally tell us how well our classifier is doing. Hyperparameter tuning using non-standard sampling methods like down or up sampling to correct for class imbalance, issues altering our boundaries for classifier thresholds and then Lastly, training and resampling multiple models. So before we get started here, just a few, uh, links to point out here on just general places where this tutorial series draws some influence number one. The Github documentation from Max Kuhn is absolutely phenomenal. It’s a very comprehensive, uh, documentation so highly recommend getting getting yourself acquainted with that. I’m going to provide a link here to this Absolutely wonderful tutorial from, uh, author Selva Prabhakaran. So I got a lot of the idea for the structure for this tutorial, actually, from this one. Um, the first time I ever went through. This tutorial must have been a couple years now, but I learned a lot from it. Um, now I’m working with a totally different sort of problem here as well as just adding a few different things in and outside of the carrot package, uh, but a lot of the influence and the overall structure. I got the idea from this tutorial. So huge amount of credit goes out to, uh, goes out to this tutorial, definitely recommend checking that out and then, Lastly, there’s a tutorial I’m going to provide here Also on just how you deal with some of these class imbalance type of problems, so just a few more things before we get started. Number one smash the like button for the Youtube algorithm. I’ll have a link in the description to my patreon account. If you guys would be willing to support me over there that would be awesome. And then the script will be provided on my Github. Repo, also in the description of the video so for this tutorial series we’re going to make use of the German credits data set now. I really like this data set for a tutorial basis because it really illustrates some of the real life problems that you’re going to run into whenever you deal with machine learning problems, so we have one response variable here. It’s a categorical variable class with two values, good or bad, referring to whether the subject in question is a good or a bad credit risk, and it’s a little bit imbalanced in the sense that good is going to show up. 70 percent of the time and bad is going to show up 30 of the time and so actually in the real world. That’s not necessarily that bad, like I personally am used to dealing with 90 to 10. Imbalances or even 95 to 5 So when I see 70 to 30 I’m honestly a little bit relieved by that. But regardless, we’re going to see some of the challenges that that can cause throughout this tutorial. Now I this this data set has a lot of variables to it, and I’m not even going to bring every single one of them in, But most of these variables are categorical, So I’m going to bring them in. And now this isn’t actually a feature of this data set. But I’m going to do it here. Uh, I’m actually for three percent of the age variable and for 70 and for no excuse me, seven percent of the employment duration variable. I’m just going to make these missing. Missing data is a fact of life that you are going to have to deal with with real world machine. Learning problems. And I want to show you later here. How we’re going to deal with missing data? So all I’m going to do is randomly sample through some of the rows for the age variable, some of the rows for the employment duration variable, and then I’m just going to make those values, uh, into N as so, just just just simulate some missing data here and just as a last step here. Uh, we’re actually going to turn everything back to numerics at the very end, But some of these variables, I’m just going to make into factors, so it’s always a good step in any machine learning problem just to get a summary of your data set, Get a feel for what we’re working with here and what we’re going to see here is for one thing this class variable here again. This is the 70 to 30 percent. Uh, response variable. I was talking about, and you want to look for some of these. Uh, variables that don’t have a ton of variation to them or have some levels that occur very infrequently, like, for instance, this number of existing credits, a variable is a red flag to me. Now we’re going to deal with some of these problems algorithmically, but it’s always a good idea up front. Just explore your data set a little bit and get an idea of what you’re working with now. One of the weird things to point out about this data set here is that these factor variables are not all coded in the same way, so take, for instance, the variables, telephone and foreign worker. Now these were just coded as zeros and one’s telephone. It’s either zero or one likewise for foreign worker, but then we have these variables, housingrent housingown and housing dot for free so rent own and for free are obviously different values of one variable that’s housing and you can verify this yourself, like if you look at all the rows, either housing rent housing owned or housing for free are going to be a one, but these are just different ways of encoding these factor variables, and so it helps to have these things in a consistent format and we’re going to do that later, but first just in the spirit of continuing with our exploration of the data set here, uh, we’re going to make use of this wonderful feature plot a function so for the feature plot, we need to pass in something to X something to y, and then optionally something for plot here now. If you look at the documentation here for the X, we need a data frame for why we need our response variable and for plot you can select for classification purposes, box or strip or density pairs or ellipse. Let’s see a couple different examples here, so starting with box plots, we can just see what the distribution of employment duration Looks like for the bad, and for the good, uh, credit, uh, classes. Maybe a tiny little bit of difference there for age. It’s not as clear. Alternatively, we can do a density plot instead. Now I’m going to do this for the same two types of variables. Uh, you can see the different density plots here now. Again, Maybe a little bit of difference For employment duration, uh, sometimes for different types of visualizations. You can see things for one that you can’t in the other, but now, actually, let’s take a look for, uh, the property variable, which again this is coded in well really four different columns here, and we see actually for this unknown. Uh, level here. That’s where the two classes look, probably the most different, whereas for the for the others, not really so much now. Another super helpful function from the carrot package Is this near zero Var function, Obviously that stands for near zero variance. And as always, the help documentation is your friend. It breaks down exactly how the function works. What we’re going to do is we’re going to return The indexes of predictors that either have one single unique value that is they literally have zero variance or predictors that have two characteristics, namely, they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second. Most common value is large, so there’s going to be two key arguments that we want to pass here. That’s frequency cut and unique cut frequency cut is the cut off for the ratio of the most common value to the second. Most common value and unique cut is the cutoff for the percentage of distinct values out of the number of total samples that is rows or observations. Now this is not a perfect function and you’re going to see why here, the default of, uh, arguments here. Our frequency cut are 95 over 5 and unique cut is 10. Now I took a more extreme case here and it returned more. Uh, variables here. But there are definitely some limitations here now. In the more extreme case, there are more variables that get returned, but a number of these are actually values of the housing and property variables. Remember from before. Actually, These were coded where a level had its own column. So when your data set is set up like that, you’re going to get some weird results as you see here and actually, that variable. I pointed out before number existing credits. Uh, just to show you that from, uh, from before. It’s this variable here where it’s a little truncated in the window here. But, oh, where the level four only appears six times that does not get returned by the near zero variance function here, so again, this function, it’s useful, but it has limitations, so don’t totally rely on it. What, I’m going to do here Just for handling my lack of variance in some variables is I’m going to take the recommendation from the near zero variance function. Uh, in both times. It was complaining about the foreign worker variable for lack of variance. And what I’m going to do. Is I’m going to use this? Uh, factor collapse function. Uh, that’s from the forecast package, uh, to condense levels, two three and four into a level two plus. Okay, now the next thing that we’re going to do in this machine Learning pipeline is we’re going to partition the data into a training set and into a test set now to do this. We’re going to use this wonderful create data partition function from from carrich, and now why this is so helpful compared to just using the sample function. Is this functions going to allow us to partition based on the proportion from the response variable meaning? Remember how we have that 70 30 split in our response, variable 70 is the good class and 30 is the bad class well. When we use this function, this is going to allow us to retain those proportions, so in both the training and the test sets, we’re still going to have that 70 30 split. Now what I’m going to do here? Is I’m going to want 70 of the data to be the training set and 30 to be the test set? Now you always need a good, healthy amount of data in order to train the model, but you need enough in order to provide. Uh, some kind of testing set so that we know how well our classifier performs. So a good rule of thumb here is about 30 percent. Maybe 33 for the testing set. I’ve seen typically anywhere from 20 to on the high end 40 percent, Uh, but 70 30 once again is a pretty good split, so when we use the create data partition function, it’s going to return. Uh, the row indices, so it’s super easy just to, uh, subset. Based on rows for our initial data set to create the training set, just stick a minus right at the beginning of our subset to create the test set and then we’re good to go. We’ve got two data sets so just to do a final diagnostic. Here we can run a summary for the training set just to check for, uh, red flags. Uh, you can see right up front that the class variable did, in fact, retain that 70 30 split, and now as far as the amount variable is concerned here, it’s important to point out that you do have some outliers here, but we can handle that in our pre-processing pipeline. I don’t really, uh, see a whole lot of other red flags here, so we’re ready to go for other pre-processing steps all right now. The last thing that I’m going to leave you with in this video Is this pre-proces’s function? And this is where we really start to get into the meat and potatoes of machine learning, because once again before we can even think about training, uh, models, we need to make sure that our data are processed and we have a really good solid format for both our training set and for our test set. Now there’s multiple transformations that we want to make to both of these. Specifically, we have missing values and we need to impute those missing values. We don’t have a consistent format for our factor variables or for our dummy variables, and we want to have a consistent format, so one hot encoding or what you saw for the property and for the housing variables where you had a different column for each of the le of the levels and then, Lastly, we want to normalize all the variables, so the range from them goes between 0 and 1. So we’re going to do all of that just by using this pre-proces’s function as well as also the dummy Vares function, which works in a very similar way, so the key argument to the pre-process function. Is this method argument? You’ll see that the default is center and scale that is just converting variables into Z scores. But if you look at the argument, a description here you’ll see possible. Values for the method. Argument include things like KNN and pute or bag, impute or median impute. This is how we’re going to impute missing data, so these are just different. Algorithmic approaches that we could take for missing data imputation. Uh, I’m just going to use. Uh, the bagging approach. That is the bag impute method. And now what we’re going to do is if we just create this bag missing object by running the preprocess function, it takes in the training set and our method is bag. Impute, it’s just going to return the model. What we need to do in order to actually transform our data. Is we use the predict function. The first argument is that model that we just created in the last line and then we have new data so the new data is, I mean, it’s actually a little bit of a misnomer because the new data or the old data so pass in the training set store that in this new data, the training set here and then bam. Our data are transformed, we’re going to do something very, very similar in this next chunk that is we’re going to specify a formula to this function called Dummy Vares, because just like I said, before we want a consistent format for our for our categorical variables, What’s known as one hot encoding so same sort of thing as before, we create this dummy model object and then we want to do predict dummy model as the first argument and then new data, which is really the old data is a training set now. This is going to work a little bit different to how this last chunk did and that what’s going to get outputted is a matrix and actually. The response variable gets dropped, that’s. Why, I’m just going to call this training Set X? Because what’s basically outputted is a matrix of all of your predictors. I wrapped this functionality here inside of the as data frame function just to turn that matrix into a data frame. So now I just have a data frame of all of the predictors. I have to go back at the end and add that response variable back in, but not before I normalize all the predictors. So this range, uh, value that I’m going to pass to the method argument here. That’s how we normalize what I mean by normalize is, we’re converting all the values to range between zero and one now in a situation where we have all continuous variables. I like to just standardize, but we’re going to normalize rather than standardize. Rather, we’re going to do a zero to one range instead of a z-score range here. So we’re going to go through a very similar sort of procedure here, just like we did up above and now in this last chunk of code. Here I’m going to add that class variable back in. I’m going to make sure the name is appropriate. I’m going to call it class right at the end here. So once again, I’m binding, uh, that class variable from the original training set, uh, to this training set X thing, which is actually the transformed set of predictors so response variable predictors all together in one data frame. Bam, we’re good to go, so that’s all for today. In part two we’re going to apply the same procedures that we did to the training set to the test set. We’re also going to see algorithmically how you can identify features, which aren’t really adding a whole lot from a predictive standpoint. Then we’re gonna start training some models. So I’ll see you all then for now. Make sure you smash the like button on the video, and I’ll see you all in the not so distant future until then, Richard on data.