Transcript:
So we’re now going to move on and on to building logistic regression models in SAS and other packages like our statistic SPSS and so on, we’re going to focus more on SAS in these set of videos and then quickly go through other packages as well. Now, the steps for building your logistic regression models at a very high level. These are the steps that we undertake while building a regression model. Needless to say, the first step, even before we do. Sampling is to create the data set code our dependent variable into binary, a zero and one then create our independent variable list and create a final data set on which we will do the modeling. The first step that we do after that, is we? We do a random sampling of the data set into both training and validation data sets. We do this so that we can test against overfitting. The idea is that we create that the model on the training data set, which is a random samples and then we apply the model on the validation and see how well is a model able to fit the desired dependent variables. So that’s the the objective. So the first step is to sample the data after that we start the model creation process so there are multiple steps in the model creation process. First step that we’re going to do Is we go to check for collinearity or correlations among the independent variables now La. District regression is also a special form of regression, so all the assumptions of regression also take layers are are relevant over here, so we need to make sure none of the variables are collinear. We need to avoid variables, which have high outlier values, etc. Once we do the collinearity filtration, we then move on to variable filtration techniques, ie, which variables are actually significant for our model. This can be done manually using expert knowledge or you can use various techniques like stepwise, backward and forward and finally we finalize the model that we’re going to work on once. We finalized our model. We move on to the testing part, so we now. We built a model on the training part, and now we start testing the model on the validation data set, so we create a so there are a lots of tests or statistics that we can generate which tell us how good or bad The model is and some of the topics that we will cover are lift charts capture rates. Lawrence Curves the ROC Genie statistics, the K statistics and finally we will finalize on a model that we feel is good enough and the last step is, Of course we go ahead and deploy the model on field. So in the next few videos, we will cover broadly all of these steps, and since this is a training series, You’re not going to go too deep into any of these steps. This is just a familiarization exercise, so let’s start off with the first step. The first step is that we need to. We need to divide the data into two parts and we go to cover all of these steps using an actual data. Set the data set that. I’m going to be working with is the telecom data set, so we have a telecom churn case study for this data set. We have about thousand observations and for each of the observation is essentially one subscriber. So what we’ve got is we’ve got typical information for the subscribers of a telecom company. We have a mix of demographic variables like, like age the region they live in the the marital status, educational status and so on gender, etc. And then we have certain behavioral variables, for example, what kind of product they are taken up? What are the? What are the charges? How how are they using the product? What’s their their average billing? What are the number of lines that are? They are using and so on and finally, we have the churn variable, which is a binary variable whether it or not, did they churn within the last one month. So the objective of this case study is that using a mix of both demographic as well as behavioral or transaction variables, we want to build a model which is able to predict whether customer will shown or not. The ability to be able to predict customer churn is very, very essential for telecom customer. Telecom companies. If they build a good model, they can actually try and prevent churning if they know who is going to churn and once they know who’s going to churn, they can try various retention techniques on those customers so in this data set. I’ve already imported into SAS. We have about a thousand observations and we have a variety of variables. Some of them are a numeric variable. Some of them are categorical variables, for example region, Even though it’s coded as a number one two three. It’s not actually a number. It’s simply a placeholder, similarly, education tells us the type of education. The customer has its coded as one two, three, four five, but these are not actual numbers, for example, two plus one does not give us a three over here. It’s just a B C and D or high school degree college, etcetera, coded as one two, three four. These are actually categoric variables in in the end. Some of the variables are already coded in dummy format so zero and one format. The idea is that if you want to use categoric variables as covered before in our lecture series, they also have to be converted into dummy coded variables, and we will cover how to do that in logistic regression automatically. So that said this is our data set And let’s run the library, lets. Check it out. So if I look into my data set, I already. Have the telecom data set loaded in case you are a paid subscriber. You should be able to access and use this data set directly. Otherwise, this telecom data set is freely available on the Internet. You can download it from there and then import into SAS now. Some of these variables are actually log transformation. So if you look over here, you have the log. Transformations of previous variables, log transformations are can be helpful, but in a lot of cases, what I see over here is that the log transformations are, in fact, not calculate because you have problems of log zeroes and so on. So do a Lawrence formation may or may not be useful in all cases, so in our case study. I’m actually going to ignore these variables and work with the others. So the first step is that we’re going to sample the data. We going to split the data randomly between tray and validation, and this needs to be a simple, random sample, so we’re going to use a random number generator. A typical split between training and validation can be 50/50 60/40 70/30 depending on availability of data and computational power. Typically, we want, we want to have as many rows as possible to build a model on, but sometimes we don’t have enough data, so this is a test data. We have only thousand observations. So what I’m going to do is I’m going to do. A 60/40 split to do a 60/40 split. What I am doing is I am Using a simple data set statement in sass, generating a random number, this random number generator, a number between 0 & 1 a decimal number if that number is less than equal to 0.6 then output the row into train else out, put it into validation. In case you want to do a 50/50 split. You can change this number to 0.5 We want to do 70/30 You can change this to 0.7 We are going to go ahead with doing. A 60/40 split lets. Run this stare and let’s check out, lets. Check the log there. We have it, it’s executed it perfectly and let’s look in the work directory. We should have two data sets created, so we have data set training and validation. It’s going to put about 60% into training sort of 30 I would have thousand observations. We see about 50 save 572 so about 50 7.2 which which is fine and validation, has about 428 observations. So that’s the first step that we’ve just done. We’ve created our training data set and we’ve created our validation data sets. Let’s move on to the next step.