Transcript:
Welcome back everybody to cradle to grave r. My name is Mark Gingras. Today we’re going to talk about how to split. A data set up into a training set and a test set randomly we’re going to build models off of that. We’re going to test the models using predictions we’re going to create all of this using the tidy verse and the carrot package and we’ll use the Mt car’s data set. So let’s jump right in, and I’ll give you some nuances of of why some of these things are important to know, so let’s just jump in because that’s the best way to do it, So let’s load some libraries here. Tidy verse is the first library so we can use. Um, all the functions of, like, um, the the using the tidy method. If you don’t have it, obviously click on install down here and just type in tidy verse and then install same thing with the carrot package. We’re going to load that up. So library, carrot c-a-r-e-t. You’re going to see this package in multiple languages. Besides R, I first learned about carrot package in Python. When I was studying that, so load the carrot package again. If you don’t have it install it real quick and first, we’re going to do is load the data. So well, we’ll create a load, the data section and we’ll simply just, um, call it my data as always and run empty cars. Now this is a very small data set 32. Observations with 11 variables. Typically when you run any of these models with small data sets. You’re you’re not going to get really good results, so please explore this later with larger data sets, find a larger data set, switch out the Mt cars with something different and, of course, your parameter of interest. Your predicted parameters that you want to, uh, make a prediction on switch those out as well so first, let’s do something called, inspect the data, so we’ll inspect data first and what we’ll use is part of the deployer or if the tidy verse is a sample N so we’ll do sample underscore N, so it’s just saying. Hey, give me a random sample of this data. And how many do I want? Let’s say I want four command. Enter on that or ctrl enter, and you’ll see down at the bottom. I have randomly four rows now. That’s what the beauty of sample n is. It Just gives you a random number? Uh, to, uh, to inspect because you don’t wanna inspect the first three or the last three or the middle three. You want random ones because you know, maybe? These numbers are outrageous, right, so now we have the miles per gallon. We have the displacement, et cetera, et cetera. We have all these different things, all right that being said we’ve sampled. It cool looks good. You should also view your data. You should also do a scatter plot based on the features that you’re looking at et cetera, et cetera. So the first thing we’re going to want to do is split the data so split data into training and test set right now to do that. We’re going to set a seed. So setseed will help us reproduce this random assignment, right so and I’ll show you how this is going to make a difference later on in just a few minutes so setting the seed tells the computer. Hey, um, I want random numbers, but I want to be able to reproduce these numbers randomly so that I can troubleshoot it, right, You can’t troubleshoot something. If you can never get the same thing twice. So we’ll get the same random numbers twice. If we continue to use set dot seed one, two three and I randomly picked one two three completely out of my head, so let’s just create a training. Dot samples we’re going to create the samples using Mt cars. Uh, miles per gallon. Uh, we do this multiple ways, but we’ll do just miles per gallon like this command or ctrl shift. M will give you the pipe operator, so I’m going to pipe the miles per gallon into a a function called create data partition. Now that’s part of the, uh. If you let me erase that real quick. If I start typing in create data partition, you can see in brackets or the curly braces. That carrot package, so that’s how you can understand like where it’s coming from. So from the carrot package, I’m using create data partition. It makes things a lot easier when you can use packages sometimes, but it does kind of abstract you away a little bit from what’s going on, so you still have to be careful and understand what’s going on, but I think create data partition with a with a partition equal to you can say 0.8 or whatever number you want. List equals false. I don’t want it to return a list. I want to return a vector. So if I do command, enter on that, what it does, it just creates as you can see on the right hand side. If I jump below instead on the right hand side, you see train in and it’s 1 through 28 and it’s got all the numbers. If I click on it, you can see all the numbers here now because there’s only what, 32. Observations. I mean, that leaves us only four four observations in our test set, so it’s a terrible example for the data set. But this will get you an idea of how to do this right, so we split it randomly now again. If I go back to this training one two three it doesn’t. You’ll see it. Skip some, uh, let’s see 31 30 29 28 There’s no 27. Here. There’s no 27. So it randomly took it right, but it’ll randomly Do that the same way because I set my C to 1 2 3. If I set the C to a different number, you’ll get a different random set right now. You’ll really notice it with larger data sets, but let’s continue on and now let’s create the model, so let’s build model here and we’ll just say. Hey, the model is going to be a linear model. We’re going to look for miles per gallon as we always do. Seems to be Tilde Dot. It’s just a notation that says. Hey, based on all the features so, uh! What do you expect miles per gallon to be based on all of these features, Right, I’m going to bring in. The data is equal to the train. The Traindata. Uh, I didn’t actually set up. My let’s let’s break this. I created a random train in samples, but I didn’t actually create the data sets, So let’s do that first. So let’s do train. DOT Data is equal to, and then now we can just subset it. You know, my my data subsetted by trainingsamples because we just created that, uh, comma all features right, so all the rows that training samples was which remember it didn’t have the 27th row, so it’s not going to include that in all the columns so train data is now set and then we’ll do test dot data and then to simply subset the rest you do this. You do a negative training dot sample, so anything any row, except for the ones that are in trainingsample. Please pull back and create this testdata right, and I want all of the all of the columns, so it’s just basically, the complement of what train data is now we can go build the model based on the training data, so train dot data. So now I have that to use, okay. Data equals train dot data. And that’s our linear model command enter. The model is now set. All right, make predictions and lets let’s test. This bad boy out. So we’re gonna so we built the model. Now we’re gonna predict using the model. I should be using that spin notation, but I haven’t memorized it enough to do it on the fly, but the spin remember spin in a previous tutorial is where you can create our markdowns using our scripts. We like to practice what we preach, right, So predictions we’re going to say predictions I’m going to say, is the model piped into a predict function and I bring in my test data, so remember, predict function we’ve used before the predict function You bring in your test data, right, and you also bring in the model. Remember now. This is the plier we’re using some tidy, verse the tidy way to do things. Remember when I pipe something in using that pipe operator right here? It’s actually just taking the place of if I just used if I just said predict. And then I have to bring in model, then my test data, right, so all that pipe does is say. Hey, I’m going to take this model. Pipe it into some predict function, but really, it’s saying. Hey, in the predict function, you take in a model and I’m just going to go ahead and not include it here because it’s it’s automatically assumed based on the pipe operator. So these are equivalent these two lines 25. And 26 right now, but I’m gonna, of course I didn’t set it equal to predictions, but that’s what’s going on, so it’s just a little bit different notation. In fact, there’s an error Testdata not found. Did I not sample samples? See, if without you guys, I would be lost. Why did that not work as well t-r-a-i-n-i-n-g? Wow, okay, it’s morning time coffee. It is predictions now. It worked now. My predictions worked. I have a set of predictions and again, though all I have here. If I click on it well, you can’t even click on it because it’s just numbers, but I could type down here and type in predictions and it’ll give me the numbers, but it’s not. It’s not that clean well again. I said there’s only four because it’s 28 training and four tests. So the four is right here. This is what it predicts. It predicts this, um, Cadillac Fleet would be 13. Dodge Challenger 17. Right. So how good did we do? I don’t know, right, so let’s continue on and do some, uh, some other things, let’s do a compare. This is what I like to do. Let’s say hey, compare and I’ll create a data frame. I’ll say, hey, dataframe. And then my actual numbers is equal to the test data. Um, miles per gallon, right, I’m only pulling out moles per. I don’t want to compare features because the features are going to be the same, except for miles per gallon, right so miles per gallon. That’s what I’m going to have my actual comma. Then I’ll say predicted equals the actual predictions that I just created, remember. I just created predictions right here in line 25. So I’m going to do command. Enter on that now. We can look at compare the data frame and say, okay. This is the actual and this is the predicted, And so you can see how much they’re off. My actual was 10.4 I predicted 13. My actual challenger was 15.5 but I predicted 17. So I’m off by all of them. But look at this. Porsche 914-2 actual 26 and then they predicted 26.51 Volvo 21.4 20 You could see some of the some of the differences and I just wanted to show you that now. One way to do this. Remember, you have to go back to statistics and understand what root mean squared error equals. Um, you know, in a nutshell. It’s just basically the the average difference between the actual and the predicted throughout all of the the data, But go back and look up rmse. In fact, we’re going to use an rmse function here. We’re going to say, uh, we’ll just call this error error collection, right, This is our error collection. So if I only have one, though I’m going to say Rmse, which is from, uh, I believe the carrot. Yeah, the carrot. And we’ll do predictions and test out data miles per gallon. So what this is doing is saying. Hey, take the root means great error using the predicted values. Compare them to the tested. You know, it’s basically we’re doing similar something similar here, except for this rmse is a function that that can return us the actual rmse without doing the calculations, right, so our error is 2.56 right, so just remember that, in fact, let’s write that down. Let’s say, let’s let’s do Error. One equals 2.56 Okay, so what I want to show you? What’s really important here? Is I’m going to rerun the whole thing? I’m going to do control a in fact just to show you. I’m going to clear out with this broom. This little sweep thing. Delete everything! Delete everything! I don’t want anything, right, I’m going to do control a control command or control enter, and you’ll see that my error is 4.808 ive. I played around with it a little bit, so let’s do this one more time, uh, command control, enter and it should be the same 4.808 No matter how many times you run that It should be 4.808 now because of I’ve made mistakes and I went back and and I made modified things. I think that’s why I had the first error. Uh, something different, So that being said what I want to show you is that when I change the seed, so remember 4.808 so let’s actually create that correctly. Here down here 4.808 now. If I run it again with a different seed, say 100 control a control enter. You will see that I have 3.803 You can see it over here right here. 3.803 right. If I run it again Three point, it doesn’t change right now. If I change this to a 10 or whatever number control a control, enter 5.08 right, so what’s going on what’s going on here? Is that every time that I change the seed every time I change the seed, Uh, it’s a different set of random numbers 1 through 32. Because we’re trying to split that training set of 32. So because it’s pulling in different training data and the models are built on that you’re going to get different error rates. Now you’ll want to find the lowest error rate, so I hope you understand what’s going on there every time I’m getting a different set of training data and that’s the value of cross validation and pulling in random data. Um, if you just did the first 28 and left the last three, you could run into trouble because maybe they’re sorted somehow, and that’s not going to give you good results either, right, so that’s the idea behind it. So what I want you to do for your homework is to create some sort of loop that’ll do. Maybe I don’t know 100 different set seed numbers and compare the rmses somehow track the rmses in a loop and then find the smallest rmse and that’s the model that you probably want to use based on the test data again, though this is a very, very tiny, tiny data set, so your values are not going to make it much sense. Honestly, so if you can find a data set, that’s got, maybe 10 000 or 20 000 rows. That’s the one you want to play with so [Music] you?