Transcript:
Hi, everyone, welcome to the session where we will learn about k4 Cross validation by using careful function from Sklearn dot model selection. So how does it work? K4 Cross validation technique splits the data into train and test data sets into different force, like suppose if the data set is split into K consecutive fold without shuffling by default. There’s a parameter shuffle, okay, and each fold is then used as a validation, while K minus 1 remaining folds from the training set. So what does it mean, it means like if I if I have the number of record as 100 and we have the index from 0 to 99 which means the number of records is under so the first 20 records from zero in the very first iteration one. The first 20 record will be my test Why, whereas from 20 till 99 will be my train data set just for understanding purpose. I have shown you because it is divided into number of folds, okay. The number of holes we have given five, so there will be five iterations, right, so in the second iteration, what will happen record from the index, 20 to 39 will be my test data set, whereas this first third fourth and fifth fold will be my train. Data set. Similarly in the third iteration, the third fold, for which the like record is from the index, 40 till 59 will be my test data set and the first second fourth and fifth will be my train data set Similarly for in the last iteration first second third fourth fold will be my train. Data set that is from the record with the index 0 till 79 will be my train data set and the last hole that is record from the index. 1899 Will be my test data set. So this is what I try to explain here, okay. There are three parameters used in the, uh, K fold cross validation number of splits shuffle in the random state. The number of split is the number of fold, which will be at least it should at least be two and, uh, from after like this SQL version, 0.2 to the default, number of 4 will change from 3 to 5. Again. The shuffle parameter, the default. It is false! If you want to do an rnd this then you can make it as true and run the k4 crossbar, and we know what is random, said for more details on cross validation. Please visit the psychic learn page. I have given the link here so now I will start my execution of k4 Cross validation. These are the libraries which will use. Uh, so I have already uploaded them and we will see so what I am doing here. I am importing the numpy keyboard, random classifier, K neighbor class by logistic communication and also in accuracy. Uh, after this, I I’ll just generate a feature with the hundred rows and two columns followed by. I will create my y. So what is my X X? Is this hundred rows and two columns and y y is y should also be Y should also be 100 roots. Otherwise, it will not okay, so we have our X and Y is generated. Now why I have used this X stack? Okay, just have a look. Uh, it will not take it. It will hardly take 30 to 40 seconds to understand this. Suppose if I have, my a is equals to np dot array and I’ll say my a is this and simply I’ll say B is equal to and B dot array my B. Is this okay now? I want to place this one, two, three, four and, uh, array a and array B parallel. So what I will do in B Dot H Stack. You can see one two, three, four, five, six. These are being placed parallel, okay, like one. Two, three, four, five, six, seven eight. Is these two areas of place battery? Why am I emphasizing on this edge stack? Because when I’ll be showing my, uh, visualization of k4 visualization of stratified keyboard? I will be using this, so just wanted to show you so now. Uh, let’s start with our where we were. Uh, so number of split. We will give five, and then we will create our keyboard cross validation object. Okay, after this, we will simply run for train. Underscore index from our test. Underscore index in CV dot split. And we’ll run this. So you can see here. Whatever ads I was showing trying to show in this visualization like my test data in the very first iteration will be from 0 to 19 You can see here my test data in the very first, uh, like visualization from is from 0 to 19 and my test data will be from 20 to 99 so 20 till 99 so if you take the upper range and the lower range here, it should be 22.99 similarly, in the second iteration, my, uh, in the second iteration, my test data will be from 20 to 39 and the train data 0 will be from 0 to 99 0 to 19 and then it will start from 40 to 99 so it is 42.99 in the third iteration, it will be from 0 to 39 followed by, uh, my train will data will be from 0 to 39 and followed by 62.99 So it will be my. This is my third iteration. So it will be 0 to 39 and then followed by 60 to 99 and my test data will be from 40 to 59 Right 42 similarly, lets. Go and check the last iteration. The my train data will be from 0 to 79 right in the last iteration. The my train data is from 0 to 79 and my test data is from 80 to 99 so this is how it works. Okay, my cable cross validation. This is this is only the index. This is not the actual data. So if I want to get my actual data. How will I get my actual data? How do you view the actual data so for that you will say print, so lets. Execute this so you can see. This is my second iteration because the space. So you can see this in this way. Our train and test is being divided. This is the data. This is my index. And this is my data, right. You can see, this is how I get my train and test data, so let’s not take this 100 Let’s take this as 20 In that case, it will be 20 and then now it will be. You can visualize this easily. Okay, now it will be visualized very easily, so in the very first iteration. If you see here, zero one two. Now if you see twenty divided by number of fold is how many four so in every hole will have four items. Okay, so zero one, two, three and four item, and these are the like my train data set test data set in the second iteration. You can see, uh, again. We have our train data set and the test data set like it is being divided. So this is how you can easily divide. But now I want to show you like in my previous video. If you see my, uh, this one these two, you can keep all the cross validation, using, uh, SQL DOT model selection and, uh, scale and dot cross validation. I have shown you, but I have only experimented with the index. I have not shown you how to extract the value of a data set. This part I’ve shown and I also did not show you. Uh, how to get the best score of a product whether using K power random foreign, so what I will do, we have the t3 classifier, Random Forest enables and logistic regression. Okay, what I will do. I will simply import the Iris data set from sklearndataset, and then we will proceed with our execution. So what I’ve done, I imported the library and also like this is my iris. Data set, which has 150 rows and four columns, the feature name or the X will contain simple length sepal with petal and little bit. Okay, and I have three classes that we will be predicting here. The three classes will be, uh, Virginica. Versicolor and Satosa will be my output Y. Okay, so that is what we will be predicting, okay. This will have some values. So this is my actual values. Now what we will do now is we have our X, and we have our y right So now we will create our. I will run a for trade. Underscore index comma test underscore index PB Dot split after this we’ll say so. This is my train underscore index. And if I say X underscore tray, so this is my train data set okay again. If I say total record is how many total because is one to ten and I’m dividing into five volts, right, so 30 will be my train Data. Will my testing and the 80 which is 120 report would be for my training after this. So my X train exit like what we do in cross validation, we have got our X strain y drain X test and white test. After this, we will create our k nearest neighbor. So now what I have done, I have created the object of my, uh, three classifier k. N, uh, random forest classifier. And the logistic regression. After this, I when I run the loop at that time itself, I’ll fit all this three models, So K N N dot fit okay now. I have fit all this three model. After this, what I will do, I will do the prediction. Okay, okay, so what I have done? I have divided my, uh, like, uh. When I’m running this loop at the time itself, I’m getting my x-ray in excess, and then I’m fitting my, uh, K N logistic regression and the random forest, followed by I’m predicting for all the three models, and then I am also printing the accuracy. What I will get in each iteration, so lets. Execute this and we’ll see what happens. Okay, so what we have done? We have created our, uh, logistic regression, K N. And the random forest classifier. After that, we have fit and predicted and we have ran the, uh, follow for in for this cross validation split, so we can see what happens here, our accuracy in the first iteration, okay, in the first iteration for all the three, my accuracy is one again in the second iteration. Also, my accuracy is one, but from the third fourth fifth iteration, you can see my accuracy is changing so okay, so what we can do. We can further get the accuracies in a list, and then we will get the mean of all the accuracy for all the classifier, so what we’ll do we’ll say, we’ll say it has empty list. Similarly, we’ll say random Forest Classifier underscore score followed by logistic regression. Okay, then what we’ll do? After this with every iteration, we will append this score. Okay, and what we will pass here our so now when we run this code, what will happen our, uh, data set? Our model will be tested for five different, Uh, fold of cross validation, and then we will get accuracy in each fold. We will append it to our list for the list, which we have created separately for each model, and then we will after that we will. We will monitor and we will get the mean of each scope, So let’s start. It’s the same right now what we will do. What is knn so you can see? There are five accuracies, right, one one or this hundred percent, okay, after that is start to decrease. So what you do, NP DOT mean, so you see the the average accuracy for KNN Model is 90.66 Similarly, what we will do we will for random Forest Classifier. What is my score? Okay, so it is, it is like again. We have the G one one zero point, eight, six, two point nine three. We’ll get the mean np dot mean, and we’ll get the, uh, we’ll get the RFC. Underscore so 0.90.66 is also same for logistic regression. What is my mean, score? It is 0.92 so lets! Print this, okay. What is my okay, so you can see here? My score, like the mean Score for KNN is 0.90.906 for Random Forest. It’s 0.906 but for logistic registration is 0.926 So you can see here and even I have printed the stepwise accuracy of the each step. Okay, so this is how we will do a crayford cross validation k4 function from SQL Dot model selection one more thing. What is the disadvantage of key folders imbalance? Data set is the biggest if you have two classes, and if it is not satisfied properly, then we have which, so in order to solve this stratify kf4 we you, we will use stratified k4 that will, uh, resolve the imbalance that does that issue of the keyfold crosswalk So that I will be coming up in my later video, the next video that will be tomorrow. I will, uh, present. Uh, I will show you how to visualize the k4 cross-validation using a macro. So thanks a lot for watching this session. If you feel this video is helpful, please press the like button and subscribe to this channel, and I will see you in my next video that will be related to visualization of k4 thank you.