Transcript:

Okay, in this presentation, we are going to look scikit-learn and the dataset we’re going to be basing at the. Eyre, this exercise on it’s called Pima Data set. It’s actually -. It’s a well known data set. It is the Pima Diabetes data set and it’s essentially what we’re trying to do with this. Data sets predict like the incidents at the occurrence of diabetes amongst a group of women of the Pima community in Arizona. Okay, so essentially, it’s a bit based on their medical history. Okay, so trying to identify the risk factors. Its very interesting data set. Okay, so it’s the Pima diabetes data set and Pima being. I think it’s a near Phoenix, Arizona. Okay, so first off anyway. Let’s get down to the actual pilot in here, so I’ve got to put in the old Reliables numpy and Pandas. Okay, pandas. Okay, and what I’m also going to do here is actually. I’m going to essentially this is going to be a binary classification, so logistic regression is sort of the probably the most prominent binary logistic regression or classifier. Okay, so essentially, a binary classification will try to predict yes or no. Warners here or true or false and logistic regression is probably the most well known of those, but we’ll also try out a couple of other ones here, cuz the linear random forest classifier and this linear SVC. Okay, so they’re from different sub components of the sub modules of scikit-learn and logistic regression. There is from linear model, Okay, random forest classifier is from ensembl and linear. Svc is from Svm. Okay, So there’s another one. There is going to add use, but I decided to just drop us at the last minute. It’s Gaussian and B. I’ll come back to that another time. Okay, so we import each of these, okay, and we import logistic regression. We import these classifiers or or import them and essentially what we’re going to do is just sort of them here. Okay, so there we go. Oh, why don’t? I run this actually first sort of get before. It’s there we go now what I’m going to do now is read in the Pima Data set, okay, and it’s a hosted on Github there. Okay, so just to watch out, it’s a raw URL, But you should be able to find this data set around the Internet. It’s quite well known okay and so. I’ll just run that on Pima. Okay, let’s just look at that. That’s the first five cases now. Actually, what I might do Here is just adjust it here so tale, just so you get a sense of how many variables there are. Okay, so this actually will give us a sort of Adam and the medical. It’s sort of good sense of this data set, Okay, its all numeric variables, in this particular instance, Okay, there’s 768 cases, okay, and the the last variable there died. AB 1 is essentially related to do they have diabetes. It’s a binary variable. The last one there died. AB 1 okay, and what we’re trying to do? Here is like build a predictive model, based on all of the other variables. Prag, Glueck, die ass trick, 2h sir PM. 1 diabeetus and find out what they’re about pregnant preggers. Presumably something to do with pregnancy. Glueck is maybe some super glucose in its about diabetes age and essentially so medical history that they’re related to medical history or medical details. Okay, so that’s grant, okay, and let’s just actually just check of any for any missing values or the the dot info on the data frame is well, just actually just out of curiosity just to give you a bit of metadata about the okay. So there we have there. There’s the shape as well, essentially, just a sort of just a quick inspection of what it’s about. It’s just a sort of good practice okay now. This is the full data set. Okay, and what? I’m gonna try and do is build a predictive model and essentially what I’m going to do here is do a train test split, so I have seven hundred and sixty-eight cases. Okay, so what I’m going to do here is I’m going to from Scikit-learn model, underscore selection and go to import. This thing called train test space. Essentially what that does is it that splits up the dataset. It’s two parts, the training data set and the testing data set. Okay, and according to this split here. The testing data set is going to be 20% meaning approximately 20% of the the original data. Okay, meaning that the training dataset is about 80 percent. Okay, so let’s just run that so just splits up the data, essentially what we do is be building our model on the training data set and then just see how good it is when we apply it to the testing data set. Okay, so essentially, this sort of splits up the data for us. OK, automatically now. Pema, is there in the background, but essentially rather than having worn da, we have our original data set, but now we have trained and test in in the environments as well as in it. We can work with these now. Okay, so there we have trained. Its it’s it’s in existence. Now, let’s say, let’s just actually just print out the head. Just just go get a look at it. Let’s print out three cases there, just so there we go so randomly selects 80% of the cases and puts them in random order also and well, so you could just tell by their the sort of updated indices here, okay. Four four seven just got randomly selected. Be the beat of the first case of the training dataset doesn’t really matter what order to come in as long as they’re assigned to training data sets or testing user sets. Okay, so the 614 case is assigned to the training data set and you can sort of see here. That 159 have been 154 have been allocated to the test dataset. Okay, – nice Sizable datasets. Okay to work with, okay. So what we’re going to do here now is we’re going to just try and pick split up our data set into the features and the targets. Okay, as in the predictor variables and the response, variables are dependent variables and the independent variables and so on essentially the X variables and the Y variables I think is, actually there’s a lot of these different types of terminology. Depending if you go on from computer science statistics, machine learning or other fields, both just suffice to say what we have Here are the X variables. Okay, now again, Dot head five, just to actually sort of not print out the whole lot just to sort of just give it a sense of what we’ve done here. Okay, So if I get rid of like if I was to sort of comment out the dot head, I get all 614 cases there. Okay, but what’s the important thing? I’ve done here is train. This is a dot. I lock, okay. That’s all essentially what I’ve done Here is sort of subset in the dish. Now there’s a loads of different ways of doing this, but I just sort of just picked out this way For this particular video. There’s in other videos. I do things slightly differently deliberately. I started deliberately makes a match so all rolls. So just I have that : there and that just tells us all roles and this is the first eight columns. Okay, as in column zero two column seven. Okay, are the first eight columns essentially, okay, so that’s just a range of columns that we can work with, okay, and so essentially those going to be my features. Okay, my X variables, okay, and I’ve got to pick up my Y variable. Essentially I’m just gonna pick out, pick it out by name. In this case, it’s die. AB one so features train features train targets there. We go so is yeah, just. I have it sort of picked out there now. Okay, so that’s a series there. Okay, let’s actually just test to see. What type is that’s actually sort of an important matter because is sort of yet. It’s a series because it sort of these when you’re putting into target and specifying this in your when you’re fitting our model, it sort of works in a suit. It assumes that it is a series or numpy array. Okay, not a data frame. Okay, it’s quite. It’s quite easy to actually accidentally subset it as a data frame. Okay, well, just actually. I’ll just show you what I mean there, and because it’s sort of quite people sort of see this, particularly if you’re new to Python just and you might sort of see this type of thing knocking around. Just that’s not going to change, but you sort of see what’s happened here. That’s that went from 64 comma blank to 64 comma one. Essentially, what happened there is. – became a data frame because of the way. I created it. Okay, there we go see, that’s the data frame That’s sort of gonna that could cause a little error. Okay well. It has caused me errors in the past, that’s why. I’m just sort of very wary of us, okay. Now it’s a series Grande again. Actually, just actually sort of good. This is actually good. Practice in general is the type command just every so often. Just check what you’re doing, okay, So the type command is just to sort of so you don’t. You know, it could save you a little bit quite a bit of hassle there just to certain every so often just sort of see type. Is it is it? That is what you expect. Okay now so. I’m going to sort of move on from that. I think I’m actually did a couple of these. Next bullet points are sort of covering over that material, But I sort of may set everything I need to say, so I’m just gonna move down a bit. Okay, so we’re gonna fit our logistic regression model LR. So we imported it as output a previously. We sort of like import. That’s actually sorry, where does this? Lr, come from, okay LR. What, what did we did something about that didn’t? We okay, that’s what we did. We just quit when we created those classifiers LR. SVM RFC, okay. Essentially, we just gave them shortened names and we’re irrelevant. We gave them additional parameter values. Okay, so for the linear SVC, we specified C equals one point zero, just for the purpose of this video. This is more so aimed at beginners. I’m not really good. You know, it’s a sort of as if this is very advanced of what does that C actually mean, okay, and I’ll be honest which I don’t really know how to explain it or in this sort of format, okay. I have a fair idea, but I don’t really know how to explain this in a way that I can convey it true. Youtube persons like this, okay, and so that’s where so LR is. Not is just being not coming out of the blue it’s. It was something we created there previously. Okay, so we’re gonna fit the model. L r dot fish. Okay, so we need our features. We need our X variables, and we need our Y variables. Okay, X and Y So, let’s just fit the model there. It’s done, oops. I went back way too far. They’re just good, lets. Just go back there again. Now, sorry just to get lost here very quickly. Sorry, now we’re back on track. Okay, so we fit the model l r dot fish and like we just sort of our X variables. There fish under training underscore feet and train underscore target features in target. X&Y, okay, so that fits our model that gives us all the specifications for fitting this logistic regression model, and as you can see here that there’s actually quite a lot of them, okay, and essentially, that requires a lot of reading and a lot of practice and be honest with you. This is the type of thing that sort of the to the nuances of all these sort of, like, take a while to get the hang of, okay. This is if you’re doing this profession. It’s worth spending time learning of all about those. Okay, but anyway, so essentially LOR DOT score, so we fit our model, okay, in 51 there and essentially what we want to do here is test how good it is so L. Our score is just essentially a score of how good it is. Okay, So L our score train feature train and train feature train target. Let’s have a look at – Grant. Know what I’m going to do here is actually lets. Put it back up there. I am going to try it out with the test data and see how good it fares in this particular instance. I can’t do that actually, because I forgot to split it up. I’ll tell you what I’d have to just come back to that later on when I have the No. I’ll do it now I could you just pause the video for a second just because it’s a sort of rehash of what I’ve done, so it’s not gonna waste any more time. Do you know what? I’ll even save it for the next video, okay.