Transcript:
So to start with this chapter, let’s first know what cross validation really is. So in cross validation. What you have is you. Let’s say you have a big training data. Set, let’s call it. X and some kind of targets associated with each row Y. And what you do Is you choose certain parts of the data set, so I have chosen your these two parts and this data set is transformed into two different data sets, X and Y or let’s call it x1 and y1 and smaller data set x2 let’s make me let’s make it smaller and smaller, just x2 and y2 So you have that now. This bigger set is known as training data set, the smaller one has known as validation or valid. Whatever you want to call it. Sometimes people also create a new set called Test Data Test data set now in most cases, it’s not required. If you do if you build a model in a proper way, you only need the training data set and the validation data set. So now what you do, is you build your model Any kind of machine learning model that you’re building you build it on the training. Set you train it on the training set and you evaluate it on the validation set. And then you see it like you can check different metrics like you see our MSE, you can check log loss, etc, etc. So how it’s performing on the validation set and it’s if you see is improving or the loss is decreasing the news doctor training and same with other machine learning models like random forest or XD boost. What were you using? You have to stop with the habit parameter tuning at certain point, so this is basically to avoid overfitting and old fitting is a concept like when your model learns only the training dataset and is not generalizing properly on the validation set or new test set. So you want to keep your model so that it generalizes more and more. So why is it so important to do proper cross validation and that’s simplest answer is generalization. If you do your cross validation in a proper way, your model will generalize on the unseen data set and hence it’s going to have a good performance. So what are the different types of cross validation? Now we can see that, so the different types of cross validation include very simple. Okay, before we go into that, Let’s look at one more example, So let’s say you have you have different labels. It’s a binary classification problem, and you have, let’s say something like this. So here you can see in this data. You have 30% of the value there once and 70% of the values are zeros. Now, if you if you split this data randomly, it’s my. It might be possible that you get that you get like, let’s say you get some values here and in the training and these are zeros and ones and here you get all these zeros in the validation, so basically, your training on all the positive samples, but you are validating only on the negative ones, so this kind of moral does not have a good performance as you can see, it’s trained on everything, mostly everything, but in validation, you have only the samples which are negative samples, so you’re not actually verifying that your model is performing well on a collection of negative and positive samples, but only on the negative samples so to to to counter that problem, you have you have to design a cross-validation system that takes care of the fraction of inputs and both training and validation set, so your distribution of the positive negative samples in both training and validation is similar to each other, so there are many many different kinds of cross-validation and the most simplest or basic kind. Is careful careful, cross-validation. Another type of cross-validation is stratified, careful and going forward. You also have cases where you have multi-label classification label classification problem or multi-level filtration problem. It’s called a multi-label classification for now, and then you have regression problems version cross validation. There are many different ways You can do that which? I’m just going to explain how I do it, And then you have a whole lot based validation, so we’re going to look at these different types of cross validation, but from application point of view, so we’re just going to pick out some problems and see what kind of cross validation we can use, and now we can implement them and implement them in such a way that we can reuse them and take any time. We want for new different problems, So let’s do that, so I think it’s better. If we just start with coding and then look at how we can implement different types of cross validation systems. So, um, I’m not going to code anything from scratch. So there are many cool libraries available and Psyche learn is one of them. I love it so in the last video last episode, we looked at this data from categorical feature encoding challenge. The first one. Now there is a second version too, and it’s good. You should take a look at it. If you haven’t yet so you can see the target column, so you only have zeros and one. That’s a simple binary classification problem, not so simple, which it’s a binary classification problem and you can see it like the count of zeros is 200,000 and you have around ninety one thousand ones. So, um, here what you can do you see like it’s not equal ratio. So what you can do is you can create some kind of stratified chat, the k-fold cross-validation and what stratified? Kiefel does is it? It splits into training and test set, but it also keeps the ratio of positive to negative sample similar or same in each. Fold, so if you have if you have thirty percent positive samples in training set, you are also going to have thirty percent positive samples in the validation set and that’s. This is quite good for us, you know, Because our validation set should be as similar as test set as much as possible. So what we are going to do today is like we are going to build a very general cross-validation framework that you can stick into. Any kind of problem, probably any kind of problem one, so I’m just going to create a new file called Cross-validation Dot Pi and what I’m going to do is. Let’s say I’ll create a new class called cross-validation and this class will take input. Let’s write the ini’t function for it, so this class has to take some input, right and your input can be a data frame or can can be several other things, so let’s say your input is always the data frame. If not, then convert it to a data frame, how? I’m going to tell you how so here you need self, and you need other variables like lets. Call it data frame. You need a data frame. You want to know what target columns are and see? I’ve written target columns, not just target column, then you you can include several other parameters, but let’s start with just much first so. I’m going to say my data frame. Is this DF, and I’m going to say my target Columns is target column, so you can have multiple target columns and then let’s see. I’m going to create another variable NUM target, so just the length of target columns and let’s keep it like that for now. And then we define the different types of we say like, okay. We want to split so. I’m just going to create a function called split and that function is going to take the data frame and look at the target columns. What kind of target columns that we have and split the data accordingly, so we are going to use some if else statements here, so if the length of target, no sorry, so Dot Num target is equal to 1 if you have only one target, then check the number of unique values in that target column, so we unique values will be self dot data frame, and then you have target columns to 0 because there is only one, so we’re looking at index 0 separately to frame dot and unique that gives you the number of unique samples, so we have tag now, and now we can go to the next step. I think you probably need brackets here. So if unique values let’s say you cannot have one unique values, right if unique values. I’m just going to do this. If unique value this one, then you don’t need to fill the model, so I’m going to say, raise an exception. Only one unique value found. Then we have, like LS unique values equal to equal to two. So now, when you have two unique values, what does it mean? It means like it’s a binary classification problem. Now it can be binary classification problem that looks probably like this in which your positive and negative samples are balanced or it can. It can look like harm from probably like this where you don’t have much of a balance. It shouldn’t be just one one, which many ones, but anyways, so in this case, it’s always a good idea to just say like. I want to have stratified. K-fold, it’s a binary classification bond. Whatever I have. I’m just going to say, okay. I want [Music] cross-validation in which you split preserving the ratio of positive to negative samples. Or you can also have k-fold. So what do you want to do? Should we do what we can do? Both let me think if you want to do what you have to do. One more for more than two unique values. So what I’m going to do is I’m just going to set this to okay If you have [Music] so you hear you can, you can probably to rather than one. We also have a problem like we’re not handling many different cases, so let’s let’s look at what are the different kinds of problems we can have, we can have a binary classification problem. We can have a multi-class classification. We can have multi-label classification, and as you can have single column regression, you can also have multi column per iteration and you can also do a holdout taste problem, and I cannot think of anything else. I hope I’m not missing anything so Ill. Change this class a little bit. I mean, we have all these things problem type, and I say like the default values final reclassification. OK, so now we don’t. We don’t need to do this anymore and I can just tap soft open type culture on the type. So you know, we are already. We already know what kind of problem you’re trying to solve anyways, So let’s just put it as an input to make things simpler, so self dot type to go to winery classification. You don’t need to do this anymore, so I’m just going to remove this. If you want to keep it, you’re gonna keep it what I’m going to do is I’m just going to keep it, and then then you have for binary classification, you can say you on. What kind of walls you want? So we will always take stratified evil. We just want the ratio to be similar every time We don’t want to go for K-fold cross-validation so in this one here, when you look at what we did last time we had this. K F, which is the K fold and gates, gets a stratified K fooled shuffle falls and some kind of random state and also says well. How many splits you want, so I’m just going to take this one and I’m going to say, okay, Its binary classification problem, so target columns, which is actually a list of column names. You have only one. So target is self-taught target columns zero, And I’m going to stick it here. Now, now! We got some more, okay, so this one, nothing it on its own. It should okay, so look at the problem side. That will make it much better. So that looks good. Okay, so we also want to know unfolds. How many foods do we want? So I stick in five? That’s default value, and we want to know shuffle. Let’s say we want to shuffle. And then we also have random state go to a for ago, so we need these now Except Dot, don’t hold this. I’ve taught shuffle shuffle and give that also dot random state. So we got all all these things. Now, let’s see if shuffle. If shuffle is true, then then just shuffle the data frame, and we have already done it in the last one. So I’m just gonna take this part from here. I’m gonna say okay, stick it in and then need to play some values like this should be sort of data frame, So I’m just shuffling the whole data frame. I don’t care about anything else and here. Now you have self-taught known fold’s. South, run upstate. So we got everything in here. Okay, so this class is looking much better now and now. What what else did we have last time? So we just do this right, so that’s what we are going to do. We are going to need one more thing here. So self dot data frame. K fold. I create a new column called people and all values are minus one there so now. I’m here and I say okay for every fold. This KF dot split. If you enumerate that, you are going to get the training indices and validation indices and folders like the index for any work. So so this is a generator. Okay, if thought that this whole thing is a generator. And whenever it generates, it generates training in distance and validation. This is so here you need value of X, which is your data frame and value of y, which is self-taught. DF, and then I’m going to say it’s this target dot values. We all have the targets here, so we’re done with this part and now. I’m just when you remove this one. Now for self dot data frame should also be data frame here. Okay, so you can see so for every fold training. The X validation index you fill in validation index for K-fold Column K fold. So once you have this here, you can add return self dot data frame, and now we can we can. We can at least just this part for binary classification problem and see if it works So what I’m going to do is I’m just gonna write, so I say CV is cross-validation. Then we need all the parameters data frame and, okay, so we need everything here. Data frame and target columns, but we don’t need problem type and everything else we’re just going to keep them as default so. I’m just going to say a target. Columns is target and I need. I need few more things that I’m anything from scale arm and pull model selection and pork and us as PD. We also need pandas and okay. We have this model selection thing. So what I’m going to do is Im. Just gonna say data frame. Is we already? Have the data input slash train dot? Csv, it’s the same old categorical feature encoding problem data. So I got this one and I’m just gonna say split as CV Dot split, which does all the splitting for me. So let’s try to run this and we get some errors, okay, so fightin process to go to the source directory and I can do a fightin cross-validation. Okay, steer object is not super writing assignment. So what the problem is something’s. I did not read the data feed data frame, so my. DF was actually a string. Instead of the data frame. Okay, let’s see. Ah, sorry for my French. Read Csv and the unique values is not defined unique values. It’s not different you. We don’t need unique values, but if we have to define it, we can define it. Unique values will be self-taught dataframe. And I’m just gonna take this thing from here. Move it up here and target the values of target and unique. So that’s my unique values and it’s done. Yeah, yeah, setting. A random State has no effect since shuffle is false. This will raise an error and 0.24 so we don’t need setting a random victim. A random state has no effect, so let’s get rid of that warning and see if it looks now. Okay, seems like there’s no warning anymore, so let’s let’s see the data, okay. I need to print it, not a my pipes. Okay, so we got. Ids, and we got just K-fold column, but we need to see if we have the proper values and careful column, so I’m just going to do value counts, If Splat, careful, so you see like I have Every fold has 60,000 validation samples, which is quite good, so we just created a class that we can probably use for all kinds of binary classification problem if you have data frame and if you have data in this data frame format and that’s that’s actually very easy because any kind of problem you have, even if it’s image related or anything, you are looking for some kind of Ids, so you can just create a data frame with Ids and assign a target to each ID. So all the problems can be converted to the state of M format, actually, and then you have to fold so now we will go to some different kinds of problem, so well, we’ll go to multiple class classification, which is also very, very, very similar to this one, so lets. Check that out first. So now moving to the Multi-level case. Sorry, Multi-clas’s classification. So yeah, Multi-level is going to be interesting. So Multi-clas’s classification. Okay, so you have two binary classification here to add the multi-class classification? What does it mean, you have multiple classes so? I’m not going to do anything interesting here, so I’m just going to say. Oh, let’s keep it like this and I’m going to say, okay. This comma multi-clas’s classification. Okay, and we are done. Yeah, it’s as simple as that. So you have multiple classes. You want to do start by default? You have binary classes? You want to do stratified k-fold simple? Now we can move to, so we are done with this. I say let’s select this. We are done with this one. Now we move to single column regression, so here work and we can also write. What, okay, that’s not right now commenting later. And so we have everything here now and we can go to regression. Let’s go to single column regression first or hold up. Aiesec single column regression. Okay, so for single column, so I’m going to make this one smaller problem type single call recreation, let’s see that’s. How if we define it? So what we have here is? We have the data frame. We have the target columns. Tired column must be one for single column and [Music]. You can also add one more exception here, If self dot NUM targets greater than 1 then we can just say it’s not equal to 1 and raise exception and while that number of targets or this property type, right, what you can do is just copy paste this one. Bring it here, and then you have the same thing you have. Target, which is self dot target column 0 because you have only one target and now for regression there are there are two to two different ways of doing this. One is the simple way that I’m going to show today and the second method. I’m just going to talk about it and I will leave it. As an exercise and the school is going to be on Github. So you can send in a pull request and the best one. We will merge that one. So the basic type. It’s just doing it. K-fold, that’s. It’s recreation problem. Don’t need to worry about a lot of things, so what you have here? This N splits, which is same as self dot. None fold and shuffle is always false. So let’s keep it that way, and then you have same thing for full comma trained index validation index. I’m just going to copy this and you write this. We are done almost. Yeah, Ting. I think this is okay. Funny, probably type not understood. So now we need to test it on a regression problem, and I don’t remember any. So let’s see if we can find a regression deficit. Oh, yeah, there was there was a very famous one house. Prices advanced regression. So I’m just going to download the data for this. Okay, let me see. What do we have here? We will have the sale price column, right. If you look at the sample submission, you have the sale price column. So that’s all we have and let me download this data, okay so. I’ll save it. Oh, just one second. I say, okay, we’re alive, Okay, Train Dot Csv. So I’m just saving the file and appropriate location, So I’m going to save it same folder. Okay, so I have the file. Now let’s see, so I call it train on. Let’s go wreck collision and you have all the columns and the last problem should be sale price. Yeah, which is the sale price. So now what I can do is I can just train wreck and I skip it. I swear and problem type. I think it was bone type. It’s a sort of binary classification. You have single column. Regression type is single common regression, lets. Run this and see what happens. Obviously it will give me an error because now your target column has changed. It’s got the sale price. My farm, right or see this price, was it no sale price. Okay, so its sale price and we are doing a simple k-fold. Now we should have everything now. Okay, awesome. Done 292 samples in each validation fold. You have the key fault column here so yes. We just made class. That can handle three different types of problems. Now we go to something else. Multi-column regression. We can go to multi-column recreation and multi-column regression, okay. I don’t think you actually need why here. Let me just take a look at the documentation quickly split. Yeah, you don’t need. Why not because it scaffold so. I’ve removed. Why see, sure, yeah, still works fine. So now your problem type is multiple column regression, and it’s also quite simple. Just use a basic K fold so now. If you have multi-column regression now, you cannot add it here. You know why because you have checked? If some targets not equal to one. Okay, but you can add it. Then you need to probably add another check, A few of your targets and stuff Dot problem type single column regression, then raise this error and say, like this multi-column regression mathematical progression. It should be more than one so I can just save some targets if it’s less than two. And then my problem is multi congregation and write the number of targets. Yeah, and that will work now. I don’t have a data set to test this one, but it will work. I’m pretty sure about this now. [MUSIC] The next thing that we have is here it is okay, so for regression. Another interesting thing is you split in such a way that you keep the distribution of the values similar in each and every fold and that’s an exercise, so if you can do single column or multiple column, just send a pull request on the ML framework and we will merge it, And if you if I if I don’t get any pull requests in the next ten days, then I’m going to do it, so you so now you have multi-label classification. So multi-label classification is when you have multiple labels associated with certain with the samples, so let’s imagine an image classification problem, and you have an image and you have to say what kind of objects out there in the image. So if there is a car, there is a bike that becomes your multi-label classification problem, so in multi-label classification problem, you you really have you have to format it in such a way that you have one target column, but you need to split the tire column based on commas or some kind of delimiter, so let’s say in one column you have the targets are like car by comma cycle and things like that. I’ll show you how so, and then you have to hold out one left, which is also very interesting one. The next one that we can tackle is the hold out and hold out is very important when you have time series data. Let’s say in that case you want to, you don’t want to do careful cross-validation because then you will be using something from the future time stamp, and it’s not a good idea. It’s going to over fit your model. You’re going to get really nice validation scores, but that won’t mean anything in real world and another important thing about holdout is when you have a lot of samples, you know, millions of data. So can you really afford to do five? Fold cross validation or 10-fold cross-validation now. So in those cases, you just select Okay, My out of ten million samples. I’m selecting 100,000 samples and I’m going to do a holdout based validation. Only that nothing else, so what we can do is we can say something like, okay. If the self-taught problem understood type now, we are going to do it differently because you can. You can have a holdout of 10% or 20% or 5% or whatever percentage, right, so if this starts with hold out underscore, so it can be hold out. Let’s call five fold out, underscore ten and so on and so forth, so you say, okay, you don’t care about how many targets you have in this case, you don’t care about if it’s vibration or classification problem, even so what we do is we will just say, okay. How can we get a second here? We don’t need a target making it easier. Okay, so here. We say, okay. What is the holdout percentage and holdout percentage will be self dot problem? Type dot slipped and we will split on the underscore and take the first element. And I’m just going to convert this to and look at it so strength and none. Then we have numb. Hold out samples, samples and that your are numb. Hold out, samples will be you can do length of self dot data frame the number of total samples. You have X holdout percentage average, and that should give here. I’m just going to say, and you can also do floor, but then I have to import Numpy, so I got this part. And now we assign the values, so we say in K fold, everything that is in the key field of data frame. So we say okay, so dot data frame dot. I lock and hold out samples, comma K fold. This is one no. We also need from here so 0 and we say okay from num pulled out samples to the end. Careful, it’s 1 so your holdout is like we have already shuffled like it’s from the user. But do you want to shuffle? They have all read, shuffle the data, and then you assign the folds, so you 0 should have more samples and this is your holdout so it will have less samples, so let’s see if that works so we can go back to our binary classification problem. Let’s keep things easy and this will be rolled out over ten, lets lets. Try to run it and say. I’ll see what happens can only index the location integer slices. Okay, so what do I need and lock? Okay, so we have only zeros 300,000 zeroes, and that’s not good so yeah, okay, so it seems like this is incorrect. Number of holdout samples is 5 if you have like five percent and hundred samples. So this should be this one. Yeah, that should work. What do we get now? We get only zeros. So where is the problem? Okay, should be divided by hundred. Yeah, here we go so everything that is one. The thirty thousand samples are by will don’t shut now and one doesn’t. It doesn’t. Okay, and here you can see K-fold column. You see only the head part of it, so it’s all zeros. Okay, so you can also do like, call out. Underscore twenty then should get twenty percent of the samples. I’m just correct, so this is important when you have time series data. Now, when you have time series data, what you want to do Is you want to make shuffle go to false? So I’m not leaving it as a default argument anymore because you should be given by the user whether they want to shuffle or not, it’s very important, and then we can move to the next part, so this is like very simple cannot be simpler than this, and this works always most of the time. Let’s say most of the time it works and then we’re done with this this this, and then we are left with the multi-label classification so for multi-label classification. How your training data frame looks like, so let’s say it’s something like this. You have the ID column and let’s say you have target column and I’d can be one two three, whatever and your target is, you have to build your data frame in way that the target column is categories separated by some kind of delimiter, so we’ll just stick to a delimiter called Just comma 6556 or whatever. And then you have here. You have seven 589 and here. You have something else again? 67 8 Sorry 1 this is like. I’ve written very randomly. You don’t need to do that and see, make no spaces separated by a comma. When you have that, probably don’t need all of them everywhere. I have only one class to class. You can have three classes like this and to build that well. Go back here and we’ll say, okay. I left self dot problem type multi-label classification. We are only dealing with one column. So if you have multiple columns for multi labels, you can just combine them into one column using a comma delimiter so and you also need to take care because if you have classes, which are you can also have class names, so you don’t need to worry about that, so I’m just gonna say if the non-target is not one, then it’s an invalid problem for me. Invalid number of targets, And now what I’m going to do is I’m going to create false based on based on the count of the number of classes, so I have our built targets, which is which actually is okay. This one says Dot Target column zero because we have only one target column, but we are going to see its self dot data frame dot apply, no set the data frame, and then you have the target on fly Lambda X, just in case, convert it to string dot split on comma, no space only comma, and it’s the length of this, so you can have one two or three classes, four or five, whatever and then rest everything remains the same as before, so I can. I can just I can take stratified capable or k-fold you know so. I’m just going to take stratified k-food. I’m just gonna say shuffle does always falls by default, and here we go so that should give us multi-label classification. So let me check if this correct, you have key food. You have to date a frame. You have y ou know why is not connect anymore, so you need to change it to targets and there are we go? We have built the cross-validation framework and we can also test multi-label classification. I don’t remember which what I’m gonna find a contradiction. Okay, probably this one. This one had multi-level. So if I go to see labels, Yeah, you can see it has culture. It has multi label, yeah. I cannot find at the moment, but I’m pretty sure this one had multi-level. Okay, so, yeah, it has multi-level. Trust me on that, okay. So what do we have here? Ah, see, okay. It was here sorry, so we have the attribute. Ids and that’s separated by space. I think we can. We can just probably say, okay. Hold out, big limit. No, sorry, Multi-label Delimiter is by default comma. But you can change it to whatever you want, and then here we’re going to say self dot multi-table the limiter, okay. I think I probably need to add this to in it, so I’ll go back up here to knit mostly building immature. Okay, let’s download the status at 10 and then we can begin test it. No, it was not this one. This dataset take huge file. No, it’s not a huge file. Let me just move it to an appropriate location so now. I have it so we have a chair. Trained, lets. Go multi-label! So you have Id and attribute Ids and they are separated by space. So let’s see if our trick works. I’m just going to say, get trained under school multi-label. This was attribute Ids and my problem type was multi-label classification, and we also have multi level limiter, which is a space now so. I really hope this works. Okay, no shuffle! Oh, yeah, yeah, yeah, shuffle is no longer an option. Shuffle, that’s true for this problem that can be true and were we have it? It works, and that’s that’s pretty amazing, so we have all the different kinds of cross validation that we need in one single class, and we can just use it for any any kind of problem that we want. You can use holdout in cases when you have a time series problem or you have a lot of data when you have time series, don’t give it shuffle, that’s what you want. I have to remember multi-label classification. So what I do is I just divide them by the number of targets for each, but there are many other ways, and I think now in the same competition, I think probably this one so by understand. Yeah, Constantan He. He posted a car and he also begins with creating the falls And let me see if I can find the code. Yeah, okay, here is so it’s it’s like some great code by Constantine, so take a look at this, and he is also making fools and he is using a similar technique, but it’s it’s not the same, but yeah, you can. You can use this technique to so take a look at it. If you don’t understand, ask me or ask one system and you can, you can use different kinds of so you can say, OK? Problem type. Is this one you have different kinds of ways to split? So you can add that as a new argument, so what? I’m going to do now is I’m going to wrap this up. That’s your different types of cross validations. We have used minimal libraries and second line is the strongest library, and you can use it in any kind of problem, and this code will be online. You also have an exercise where you have to do the recreation part in a much intelligent manner and it helps. It does help if you if you split them by if you split them. By keeping the ratio same, not the ratio at the distribution similar. Your problem is much better. So now you have two cross-validation. And what’s next the next thing we’ve already done it in the last episode, so we had, we had a training file and inside the training file. I always used this fold mapping, so you can create a fold mapping like this, and it’s very easy to create, so it says if your fold is zero, so you choose in the data frame, you choose from the K fold column wherever it is 0 and that becomes your validation and everything else everything else here becomes training. So you train on these and you validate on this train on these you validate on this and so on and when you do that, you also can create predict predictions for each of these validations, and when you do that after that, you just take an average of all the predictions, and so you have, like five different models or you again. You can use them to do hourly stopping if you’re using a neural network or even with extra boost or LG PM, so we’ll cover that in future episodes, and I think that’s all for now. Let me know in comments If I miss something, if I miss something. I’m going to create another episode, which is going to be continuation of that, If not, I’m going to move on to the next one with categorical data and that’s that’s just really super interesting, so many different ways to handle get to the data, and sometimes it’s just way too easy. You know, OK, then see you next timebye.