Transcript:

The goal of this tutorial is to solve a simple classification problem using logistic regression. If you followed my previous tutorial, we have learnt a lot about linear regression, especially the home prices. Example linear regression can be used to predict other things such as weather and stock prices. And in all these examples. The prediction value is continuous. There are other type of problems such as predicting weather, email, spam or not whether the customer will buy the life insurance product or person is going to vote for which party all these problems. If you think about it, The prediction value is categorical, because the thing that you are trying to predict is one of the available categories in the first two example, it is simple. Yes, or no answer in the third examples. It is one of the available categories, whereas in case of linear regression, the home prices example. We saw that the predicted value could be any number. It is not one of the defined categories, Okay, hence, this second type of problems is called classification problem and logistic regression is a technique that is used to solve these classification problems now in the classification examples that we saw there are two types, so the first example was predicting whether customer will buy insurance or not here. The outcome is simple, yes or no? This is called binary classification on the other hand. When you have more than two categories, that example is called multi class classification. Let’s say you are working as a data scientist in a life insurance company and your boss gives you a task of predicting. How likely a potential customer is to buy your insurance product and what you are seeing here is the available data and based on the age. The information you have is whether customer bought the insurance or not. Now here you can see some patterns such as young people don’t buy the insurance too much. You can see like there are persons with 20 to 25 These kind of ages where zero means they didn’t buy the insurance, whereas as the person’s age increases, he’s more likely to buy the insurance, so you know the relationship, and you want to build a machine learning model that can do a prediction based on the age of a potential customer so as a data scientist. Now this is the job you have been given now. The first thing you would do when you have this data is you will plot a scatter plot, which looks like this when you have walked on linear regression problems already, the first temptation you have in your mind Is you start using linear regression so when you draw our linear equation line using the linear regression? It will look something like this now. How did we come up with this line for that? You can follow my previous linear regression tutorials. If you think about it, what I can do here is I can predict the value using a linear equation line and say that if my predictive value is more than 0.5 so here, this is 0.5 if it is more than 0.5 Then I will say OK? Customer is likely to buy the insurance if it is less than that, then he is not going to buy the insurance. So anything on the right hand side is yes. Anything on the left? Hand side is no now, Of course we have these outliers, but we don’t care about them too much because for 90% of the cases of our linear regression will work. Ok, now imagine you have a data point, which is far on the right-hand side here, so say, a customer whose age is more than 80 years. Let’s say he bought your insurance. OK, then your scatterplot will look like this, and your linear equation might look like this. In this case, what will happen is when I draw a separation between between the two sections using 0.5 value here. The problem arises with these data points. Actually, the answer was yes here, but my question predicted them to be no. So you can see that this is pretty bad when you use linear regression for data. A class like this now here is the most interesting part. Imagine you can draw a line like this. This is much better fit compared to the previous linear equation that we had okay and here when you draw a separation between Z Using 0.5 value, you can clearly say that this model works much better than the previous one. The question arises. What is this line exactly? And how do you come up with this, right. If you have learnt statistics, you might have heard about sigmoid or logit function. And that’s what this is okay now. The moment you hear this term sigmoid. You might pause this video and start Googling about sigmoid, and it is fine. You can read all the articles about. Sigma or logic, logi’t function to get your understanding correct on mathematics behind it. But if you don’t want to do it. I will give you a basic idea. The Sigma functions equation is 1 divided by 1 plus e raised to minus Z, where E is some mathematical constant is called Europe Euler’s numbers. The value is this now. Think about this equation for a moment what we are doing here is? We are dividing by 1 by a number, which is slightly greater than 1 and when you have this situation, the outcome will be less than 1 correct, so all you are doing with this. My function is coming up with a range, which is between zero to one. So if you feed set of numbers to the sigmoid functions, all it will do is convert them to zero to one range and the equation that you will get. It looks like S-shape right, So if you plot a 2d plot 2d chart, it will look like a shape function that we saw in the previous slide. Essentially what we are doing with logistic regression is we have a line like this, which is linear equation, and you know the equation for our linear line, Which is MX Plus B. All you’re doing is you are feeding this line into a sigmoid function. And when you do that, you convert this line into this a shape. OK, so here you can see that my. Z, I replace with MX Plus B. So I applied sigmoid function on top of my linear equation. And that’s how I got my has shaped line here all right now. All of this math is just for your understanding. As a next step. We are going to write logistic regression using SQL on library and these details are abstracted for you, so don’t worry about it. You don’t have to implement all of this mathematics. You will just make a one simple call and it will work magically for you, all right, so let’s get straight into writing The code. Here is the CSV file containing the insurance data you can see. There are two columns age and whether that person bought the insurance or not and we are going to import this into our panda’s data frame, so I have loaded my Jupiter notebook by running Jupiter notebook, Come on on my command line, imported couple of important libraries, and then I imported the same CSV file into my data frame, which looks like this and now. I’m going to plot a scatterplot just to see the data distribution. And you can see that I get a plot like this here. These are the customers who didn’t buy the insurance. These are the ones who bought the insurance and you can see that. If the person is younger, he’s less likely to buy the insurance and as the person gets older, he is more likely to buy the insurance. The first thing now we are going to do is use trained taste split method to split our data set. So if you look at our data, we have 27 rows, so we are going to split these rows into training set and test set again. I have a separate tutorial for how to do train and test split, so you can watch that it is basically from. Escalon model selection, you import train taste split method here. My X is DF age now. I am doing doing two brackets because the first parameter is X, which has to be multi-dimensional array. So I’m using. I’m just trying to derive a data frame here. And what insurance is why and I will say, what is my taste size? If you want to see the arguments, you can do shift tab and it will show you a help for this function. So I used this a lot. It is pretty useful, so let’s see so there is this taste underscore size parameter, so let’s use taste and just core size. We are going to do or less less a train size, right, so training size is 0.9 so 90% of the example we are using for training and 10% We will use for actually testing over model now. What do you get back as a result? So these are the things you get back as it is all so? I’m just going to copy from here and that’s it hit ctrl. Enter to run it, okay, so here. There’s some warning. Maybe they are asking us to use test size. Doesn’t matter. Okay, let’s look at our test. So what test is 18 23 and 40 so these are the three values we are going to perform or taste on when you look at our X train, these are these are the data samples we will use to train our model, all right, so let’s now import logistic regression. So from same linear model, you can import logistic regression logistic education. Eske lon alright. So we even have logistic regression class in ported and we are going to create an object of this class. We’ll call it a model and that model. Now we’ll do a training, remember in SQL, And whenever you are using this method fit, you are actually doing your training for your model. So X train invite train. This is what you use for your training when you execute this, This means your your model is trained now, and it is ready to make predictions, so what these three values we are making a prediction so I will do Model Dot product and X paste so here? What it is saying is 0 0 1 which means first two samples. It is saying these two customers are not going to buy your insurance and you can see that it’s kind of working because they have age of 18 and 23 year old, and we saw that as the ages the younger age people do not buy the insurance, whereas I think anything more than 27 28 to buy so here. The age is 40 so the answer was 1 Okay, if you want to look at the score score is nothing, but it is showing the accuracy of your model, right, so what you’re doing is you’re giving. X tests, invitees and here the score is 1 which means our model is perfect. Now this is happening because our data size is smaller. We have only 27 samples, but if you have more wider samples, then it will make mistakes in at least few samples, so your score will be less than 1 right because of the small size of our data set. The score is pretty high here. Another method to try is you can see that benefit by the way tab. It will show you all the possible functions that start with ready. Okay, so here you can also predict a probability. So when you predict a probability of X test, it will show you a probability of your data sample being in one class versus the other. The first class here is if customer will not buy the insurance, so for the age 18 and 23 you can see this point Six percent probability that they will not buy the insurance, whereas for the person with age forty, it is reverse. There is 0.6 percent poverty that he will buy the insurance and point thirty nine percent probability that he will not point thirty nine percent. Really, it’s really thirty nine percent that he will not buy the insurance. If you want to do one off, then you can just do model predict this six. You will buy the insurance. That’s why you had one, and if you had something like twenty five, he will not buy the insurance. That’s why you get zero. So this model that we built is working back pretty well with logistic regression. That’s all I had, and now is the time for exercise, So if you know about Cagle of website, this is the website that hosts different coding competitions and it has one of the more important feature, which is the data sets. So if you go to this data set section, you can download various data sets based on the type based on the file type or you can even search for data set. So if you want to do some. Titanic Titanic Data Analysis. You can search for that. Basically, you can just explore these data sets for exercises from this. I have downloaded this HR. Analytics data set where there is an analysis on the employee retention rate or employee attrition rate. If I open that CSV file here. It looks like this where based on the satisfaction level, there’s no number of projects or average monthly hours that person has worked on. You are trying to establish the correlation between those factors and whether person would leave the form or whether he would continue with the form. These kind of analytics are very important for HR Department because they want to retain the employees And if you can build a machine learning model for HR Department, then they can focus on specific areas so that employees don’t leave at the farm. So that’s what you’re going to do. You are a data scientist. You’re going to work for your HR department and give them a couple of things, so I have mentioned all of those things in the Jupiter notebook, which I have available in the video description below. So if you open that notebook, you will see all the code that we just went through in this tutorial. And at the end, you will find this exercise section, OK? So there is a link here. You download the data set. If you don’t want to download it the same level as this notebook, there is an exercise folder, so download the CSV from that, and you’re going to give answer on these five questions. OK, first one is out of all these parameters that we have you want to find out Which factors affect the employee retention by doing some exploratory data analysis. You will also plot bar charts showing the impact of employee’s, salaries and retention also plot the bar chart, showing the impact of department and employee retention and then using the factors that you figured in step one, you will build a logistic regression model and using the model. You are going to do some prediction. In the end, you will measure the accuracy of the model. Let’s do that exercise in the comments below. Let know your answers, and if you want to verify the answers then. I have a separate notebook at the same level in exercise folder, which has all the answers. But don’t look at the answers directly, okay. A good student is someone who tries to find the solution on his own, and then he looks at the answer. All right, that’s all we had. Thank you very much for watching. I’ll see you next.