Transcript:
[MUSIC] Hello, everyone and welcome to this video In this video, I’m going to explain the concept of logistic regression classifiers and will directly apply it to a practical case study in the case study. We will assume that you have been hired as a consultant to a startup that is running a targeted marketing ads on Facebook. The company wanted to analyze customer’s behavior by predicting which customer would possibly click on the ad customer’s data, such as time spent on Facebook and their estimated salary will be used to develop our predictive models. Logistic regression is used to predict binary outputs with two possible values either zero or one simply put logistic model outputs could be one of two classes pass or fail win or lose healthy or sick, so let’s get started with the case study. All right, so let’s get started with the logistic regression intuition. First linear regression is used to predict outputs that are continuous, so for example. If I wanted to predict, maybe the, uh, the salary of an employee, for instance, or maybe I wanted to predict the stock price, for example tomorrow. For instance, um, all these outputs are continuous outputs and we can use simple linear regression to do that job for us. However, logistic regression is used to predict binary outputs that has two possible values either zero or one, so logistic model outputs could be again as I mentioned. One of two classes either win or lose healthy or sick and so on, so let’s take a look at an example, so I can illustrate the idea. Let’s assume that I have this table here, and I wanted to develop a relationship between the number of hours of studying. Okay, so here. I have the number of hours of studying that any students could could do and on the y-axi’s here. I have the pass or fail, okay, so simply put. If the students study for one hour, most likely, they will fail if they study for one and a half hour, they will fail. Two hours will fail three hours. They will pass the exam 3.25 hours. They actually get a fail and then four pass five pass and six pass as well. So if we decide to take that data, which is again we collected in the field. You know we. We collected data about many students and we want to go ahead and print it out here or plot it out. So you will find that here. I have every single data here are captured in on that graph, so you will find that simply. I have kind of two levels only. I have level zero, and I have level one. That’s all what it is, so if you try to fit these data points with a simple linear model like that, you will find that your model will actually fail dramatically, okay, because the the the actual data is designed for I would say classification. You can’t just go ahead and fit a straight line like that. The performance would be really poor, and that’s why we have to kind of shift from just the basic linear regression models into what we call it logistic regression models and try to capture that, you know, like non-linearity in there so to do that, I can simply go ahead and do the logistic regression, so let’s take a look at if you wanted to to fit in a logistic regression model instead of a simple, linear regression, So let’s go ahead. We have our data again and again. Here we try to fit the linear model we’re not going to work and now we actually need to fit a logistic regression model. That looks like that, okay, and if you actually do this, that would be perfect because now the, uh, the actual logistic regression curve is saturated. It’s just, you know, it ranges between zero and one, and there are some values in between maybe, like point one point, two point, six point eight, so I have kind of, you know, like different ranges of probability in between, but they saturate between zero and one, and that’s basically what I’m looking for because again I wanted to perform a classification task. Basically, okay, all right, so let’s take a look at some math for the logistic regression. So as we mentioned, linear regression is not suitable to perform classification problems, and that’s why we we apply the logistic regression in general to do the classification for us. But if you actually take a look at the math behind the logistic regression, you will find that it actually started as a linear regression model as just a simple linear regression model. So here I have the equation for the linear linear regression. Basically, it’s an equation of a straight line, simply the output Y equals 2 B node, plus b1 Times X. And what we do is that we just apply. A non-linear function called the sigmoid function. We call a sigmoid activation function, and then we we generate what called the probability, so we say P X equals 2 sigmoid of Y, where Y is represented by that straight line equation. So if you substitute with the equation in here and the sigmoid is simply 1 over 1 plus E power minus Y. If you substitute with y here, you will come up with the probability of X equals to 1 over 1 plus E power minus, and you put the straight line equation, which is B note plus b1 X. And if you actually go ahead and, uh, print or plot that equation, you will come up with this orange line in here that tell me the logistic regression, uh, curve, that’s actually perfect to perform our classification task. Okay, all right, so the question is okay now. I have actually a continuous output too. I actually don’t have two classes, so what I could do here. Instead is I can simply say, okay. I’m going to select the threshold and that threshold might be maybe like 0.5 for example, and any value that is below 1.5 and above 1.5 will be two different classes. So for example, I can have any value that is less than 0.5 that would be classified as class 0 any value that is above 0.5 will be classified as Class 1 And simply now I have a model that can do classification for me based on the outcome, based on the probability of the output, and then after I generate the probability I can just apply threshold that can tell me if I’m class zero or class one. That’s all what it is, okay, all right, so that’s all what I have for the intuition lecture. Let’s go ahead into the codes and see. How can we apply logistic regression to perform classification on, uh, an actual case study? Okay, all right, so now we have the Jupiter notebook here open for us and you guys will have a link to the jupyter notebook as well, so you will be able to download it and run it on your machine. Too so here is a problem statement, so you have been hired as a consultant to a startup that is running a targeted marketing ads on Facebook and the company wanted to analyze customer behaviors by predicting which customer clicks on the advertisement based on the data. So we have the name of the customer. We have their email. We have the country. We have the time on Facebook and we have the estimated salary as well and these are simply the input to the model, and the output is simply just binary output might be one or zero. That’s all what it is. One indicates the customer has clicked on the ad and zero indicates that the customer did not click on the ad. Okay, so first we are going to import our libraries and data set and we’re going to import pandas as PD. Mainly pandas is used for data frame manipulation we’re going to import numpy as NP and Numpy is used for numerical analysis and matplotlib and seaborn are primarily used for visualization and data plotting as well. If you press shift enter, that should run the cell. You find the cell here ran successfully Looks good and simply here. I have my data which is consists of entitled Facebook ads to CSV, and I’m just going to use pandas to read CSV, So I’m going to say Pdreadcsv and that should load my data and here I’m going to feed the data in a data frame called training set. If I check out the training set here here, we go here. I have again a bunch of names here. I have a bunch of emails here. I have the country here. I have the time spent on site and here. I have the salary and here I have. This generated kind of the genetic prediction, which is simply either zero or one. Okay, please note that we are going to actually just for the sake of simplicity here. We’re gonna assume we try to kind of, you know. Use only the most important data out of our data frame. So for example, the names I’m just gonna drop it. I don’t really need it. The emails, I’m gonna drop it as well. I’m just gonna Primarily Focus on two features, only, which is primarily time spent on site and the salary as well and the output will be the predictions whether the customer clicked on the ad or not. If I check out the tail of the data frame, you will find that I have around 500 samples, give or take and again the the outputs here are either zero or one, so let’s go ahead and explore our data and perform some data visualization. So what I’m going to do is, then I’m going to simply classify my data or just divide them either. Two categories, the one category and the zero category, which is simply click or no click. So here I’m saying, okay. Please go ahead to my training set. Which is my data frame? If you find my training set off clicked if you find the column clicked, which is simply here the output. If you find it equals to one that means the customer has clicked on the ad if it’s zero, that means the customer did not click on the ad, so it’s just gonna basically divide them into click and no and no click and here. I just wanna explore, maybe and see do. I have a balanced data set or not, so what I could do here. I can say, okay. Please go ahead. Show me the length of the click and length of no click. So I want to see both and I want to see as well. The percentage, which is how many customers out of my data in the data frame have clicked on the ad and how many did not click on the ad? If you press shift, enter here we go. You will find that I have. I would say extremely balanced data set, which is kind of, you know? I would say pretty, pretty unique in general when you, um, in practice, you don’t. Have you know, like perfect data set when you deal with real world data so here we have total is around 500 samples. Number of customers who clicked on the ad is 250 and the percentage is around 50.1 percent and the rest who did not click on the ad is around 49.89 Obviously, if you sum them up, you will come up with 100 Let’s go ahead and use Seaborn scatter plot to plot simply my time spent on site versus my salary and on the hue as well. I’m going to show the two classes, which is simply the customers who clicked and the customers who did not click. So if you press shift enter, that’s what you get so basically here. Here are my two classes here. I have the, um, simply class zero. Which is the blue dots? The orange here that’s class one, So it looks like this. This data set is not linearly separable. So which, which means now I need like a more of an I would say non-linear boundary to actually kind of, um, uh, classify these two classes, so let’s see how we’re going to do. We’re going to deal with that in our in our code, so I’m going to do here. Is I’m going to plot the box plot to show me the clicked versus salary here? We go so here I have. It looks like the, um, the people who actually clicked on the ad, which are Class 1 have in general higher salary compared to other customers who did not click on the ad. If I do the same with the time spent on site, you will find that the average for the people who actually clicked on the ad have higher time spent on site compared to the blue class. If you plot the histogram for the salary here we go the salary distribution. The mean is around. I would say 50 000 and it ranges between 0 and 100 000 and if you want to check the time spent on site, that’s what you get. Basically, you will get again. That’s a distribution on average between 30 and 40 minutes, approximately. Okay, all right, so let’s go ahead to step number three and prepare our data to perform the to perform the training, so let’s go ahead and check out our training set again. That’s our data and what we’re going to do here. Is I’m going to drop unnecessary columns? Just going to get rid of the names. I want to get rid of emails country. You can actually keep if you wanted to, and but you have to convert it basically into, um, using a dummy. Um, dummy variables to feed them to the machine learning model, But what I’m going to do here is just for the think of simplicity. I’m just going to drop the country as well. Okay, So if you press shift enter now, it’s gone. Check out the training set here. Is I ended up with time spent on site salary and I have the output, which is clicked or not. Okay, so now I’m pretty much ready to actually divide my data into training and testing so first I’m going to allocate my data. All the inputs. We’re going to call it X uppercase. All the outputs we’re going to call it Y lowercase so here. I’m going to say, okay, please go ahead to my training. Set drop clicked, and that would be the X and the actual column clicked that will be the output. That would be the outcome which I’m looking for my target. Variable y shift enter. Looks good now. I can actually scale my data. Using Sklearn dot pre-processor and using standard scalar will scale my data. So if you press hit enter, looks good and now I can use scikit-learn to divide my data. Using obviously train test split to divide my data into training and testing. So again, I’m going to import the library here and here we’re going to use train test split feed in my X and Y and please note that the test size is set to 20 which means I’m going to use 20 of my data to test for testing and 80 for training. Check out the extreme shape. So that’s my X train. Two columns. That’s my y train, which is one column and now I can simply go ahead and train a logistic regression classifier model in cycle. Learn, it’s actually made it very, very easy. You don’t actually need to worry about the actual implementation and the math behind it. Just, you know, two or three lines of code. Then you’re good to go. So we’re going to say from Skiller Linear model. We’re going to import logistic regression and then we’re going to say I’m going to here, Imported the class, and then here I’m going to instantiate an object out of my class, and then I’m going to apply the fit method to my object. Feed it in my training data, which is X strain and Y train and you press shift. Enter and here we go. The model is trained and now in step number five. We are going to test our model. So what, I’m going to do here? Is I’m going to take my classified model and apply the predict probability onto it? So if you press shift enter, that would give me the probabilities that will give me the predictions from my model and I can do the same as well here. Okay, so here. Um, like I see here is that we have done classifierpredict. Probability these are the actual probability that has been generated out of my model. Okay, but what I what I could do is. If I say classifierpredict that would give me the actual classes either zero or one. So now I got a bunch of zeros and ones, and we actually have done that on the training data. What I need to do. I need to do that on the testing data as well. So here, I’m gonna say actually gonna run that to check out the white train, and I’m going to say Classifier Dot predict. I’m going to feed it along X test. That will give me my predict test and here I can simply plot the confusion matrix. So what I could do afterwards is that I could visualize the results. Simply by plotting the confusion matrix, there’s. Just one kind of you know, Matrix that show us the performance of our classifier models So simply I can say from sqlearnmetrics. I’m going to import confusion matrix, and then we’re going to say confusion matrix. Feed it along my y test, which is simply my ground truth versus my Y predict test, which is simply the model predictions coming out from, uh, from my model, and if you use seaborn as an S dot heat map that will plot me the confusion matrix and here. I have 43 samples that have been correctly classified. These are the two positives. I have 43 samples that have been correctly classified. These are the two negatives here. I have five and nine. These are the samples that have been misclassified by the model, which I would say pretty good overall given that we we just we didn’t. Do any hyper parameters tuning or any of that? Just use cycle learn, and it was very straightforward. I can also print the classification report so I can say from Sqlearnmetrics. I’m going to import classification report, and then I’m going to import. We’re going to print the classification report, and you find that we have reached average or total of around 86 which is not bad precision of 0.83 on class 0 0.9 on Class 1 recall of 0.9.83 and f1 score, which is the harmonic mean between precision recall is 11.86 and 0.86 which again not bad, okay. The last step is that we are going to visualize the training and testing data set. I just wanted to see, okay. I want to print my original data points and I wanted to print as well. The boundary line that actually separate the two classes and I want to see this basically visually. I just want to see like the did my model actually classify the two classes correctly or not so first I can simply here use a mesh grid and that mesh grid will create just a mesh for my data and I would be able to afterwards to plot the boundary line first. So if you run this cell, and if you actually check out the Y train shape and exchange shape. And if you check out the x1 shape as well here, what we’re doing is that? We just wanted to print our trained classifier. I just want to see the actual boundary line between the two classes. So if you press shift enter, that’s simply what you get so here. I have my two classes, the blue class and here. I have the pink class and that’s my straight line in here. So what I could do right now, is I? Can simply print the actual dots, which is the ground truth, which is actually the actual data, The training data that I had on top of this. So here I plotted only the boundary here. I’m only plotting the actual training points and here I’m going to put them on top of each other. Okay, so let’s go ahead and print that shift. Enter here. We go so that’s again. My original data points. So if I print this and this on top of each other, that’s what you do. What, that’s what we’re doing here? If you press shift, enter here, we go, you will find that simply. That’s my boundary and all the points. All these points have been correctly classified all the blue points on the pink side. These are have been misclassified and on the other hand, the vice versa, too. So all the points that are pink on the blue side. These points have been misclassified as well and you can do as well the same for the testing data. So that’s again, my results, and these are the samples that have been, uh, wrongly classified or misclassified and the blue dots. Here these are the point samples that have been misclassified as well or classified as a class zero, but they were belonging to the class number one, so that’s it. That’s all what I have for this case study. I hope you guys enjoyed it. If you like this video, please hit like and subscribe to my channel for more videos.