Transcript:

[MUSIC] We are going to use these four libraries. [MUSIC] Let’s read data file. [MUSIC] It’s called Binary Dot Csv. If you look at the structure of the data set, it has four hundred observations and four columns admit is the response variable so whether a student was admitted or was not admitted. And then we have GRE scores GPA and also rank of the university from where they came so first we will do cross tabulation for admit and rank and our data set is called data so ideally, we should have these frequencies more than five. So we are okay here now. Let’s also convert variable admit and rank to a factor variable because they are not actually integer. So now, if you run structure line, you can see that admit. And rank are now converted into a factor variable for this. We are going to use pair start panels and I’m going to remove first variable for developing an eye base classification model. We need to make sure that the independent variables are not highly correlated. So let’s plot this. The only variables that are numeric are. Gre and GPA and correlation between them is not really that strong and correlation coefficient is only 0.38 It’s not very high. Now, let’s create some box plot ggplot and we can define aesthetics. AES where X axis will have variable admit and y axis will have gr e– and let’s fill colors based on admit, and then we can use plus sign to add next line of code. G E om. So a geometric underscore boxplot and we can also give a title so Gigi title. So you can see the Red One is where a student is not admitted, and that’s the GRE spread, whereas this one is for admit equals one where students are admitted, so there’s a significant amount of overlap, but overall average GRE is higher for students who are admitted compared to students who are not admitted, so this plot basically shows there is some potential to develop a classification model, but that model is unlikely to be 100% accurate because there’s lot of overlap between the two. If we do the same thing for GPA, we get this plot. Let’s also do a density plot so first we’ll do for GRE on the x-axis and we are going to fill color using admit variable and we can use a geometric underscore. GE OM underscore density. Let’s also use alpha of 0.8 so basically, it will indicate how transparent the graph is. Let’s use color as black. I’m also going to put a title here so you can see overall this density plot. Where admit is one is on the right side. These students have highest GRE scores compared to the red ones where students were not admitted. But now you can clearly see. There is a significant amount of overlap between the two. Similarly, If you do this for GPA, you get this picture here, so we will use random seed set dot seed of one two, three four, so we are going to split it into two number of rows, whatever we have in our data and this is with replacement so replace equals T for true, so we’ll use 80% for training data and 20% for testing so within I and D and two equals to indicate that Ind is equal to 1 similarly test. See, our test? Data has 75 observations out of 400 and training data set has 325 So now you based algorithm is based on Baye’s theorem, and the equation is given by P A given B, which means probability of event a given that event B occurs, so this is given by this formula probability of event a times probability of B given a divided by probability of event B so note that the assumption made in this equation is that a and B are independent, so let’s consider an example where students are applying for a graduate program in engineering So we can say probability. That student is admitted given. This student is coming from. Rank 1 school using Bayes theorem. This probability will be given by probability that the student is admitted. So this is a simple calculation, so in the training data set, we can look at how many students have been admitted divided by how many student applied so that will give us. Probability that student is admitted, so this multiplied by probability that student is coming from. Rank 1 school, given that the student is admitted, and then that is divided by probability that the student is coming from Rank 1 school, so lets store model in model and the function is nigh underscore base so admit is our response variable, so admit as a function of all other variables, so I’m going to put Dot and for developing the model. We use train data so now the model is created, lets. Look at the model. So in the training data, we have about Sixty eight point. Six percent of the data points belonging to category where admit is zero and 31.3% data points. Where admit equals one. So in the training data thirty-one point, three percent of the students were admitted to the program And then we have three other tables So for each A quantitative variable in the data set like GRE or GPA, It also calculates mean and standard deviation. So in this case, we are given mean and standard deviation and please note that when we have a normal distribution once we know mean and standard deviation, we can calculate any probability, so in this analysis Whenever you see your independent variable, that is numeric, we will see its mean and standard deviation values are given and for categorical variable. Basically, you get the probabilities. So when the independent variable is categorical, for example, rank is a categorical variable, So this means that the probability that a student who has applied from rank one school given that this student was not admitted, so that probability is 0.1 zero three. Similarly, we have probability that a student applied from rank one school, given that student is admitted so admit equals one, so that probability is zero point two, four five, so these probabilities are given in the table, so our data use was trained and then. I connect this to next line where I can filter situations where admit equals zero. So let me put this in two coats, and then we connect this to summarize so mean of GRE and standard deviation of GRE. So if you run this, we get five. Seventy eight point six, and that’s the same number as what we have got here for zero. Similarly, standard deviation is one sixteen point three, two five, so one sixteen point three two five is the value when admit is zero. And if you run the same thing with one, you get other two numbers, we can also plot the model. So if you run this line, we get actually three plots. This is the third one. So let me go back. This is second and this is first one density plot for GRE. We had earlier made a density plot so that plot looked slightly better and the next graph is for GPA, and we get this plot for categorical data, so rank is categorical and it has four values so one two three four, So green means students getting admitted and red means students are not admitted to the program. So clearly you can see that. The chances of getting admitted is higher when a student is coming from one ranked school, whereas a student coming from rank for school still has a chance, but it’s much smaller. Let’s tour predictions in pee, so we’ll use the model and train data and type of prediction is probability, so this will include all the probabilities, so we can actually take a look at some of these probabilities as well as original data to see what’s going on, so we look at the first few rows and I’m going to use C for column bind. We look at P and training data, which is called train, so let’s run this. So first applicant has a probability of 0.84% so roughly 84% chance that this student will not be admitted, And in fact, the reality is that this student was not admitted, so this student had low GRE score. You can see 380 and GPA was three point six one, but this student came from a low ranked University or college. Now, if you look at second applicant has a 62% chance of not getting admitted, but this student, in fact was admitted, probably because GRE scores are better, so lets store predictions in p1 so then we’ll create confusion matrix. Let’s call it Tab 1 and I’m going to put the whole thing in parentheses so that when I run this line, it will also print the line so you can see. This is the confusion matrix. So 196 students were correctly predicted not to be admitted and there were 33 correct? Predictions for students getting admitted we want miss classification. We are doing one -, So Miss Classification is about 29.5% so very close to 30% so let’s repeat this for test data, so well store predictions in p2 and the data is test confusion Matrix in Tab 2 and then we use tests and tap – tap -. So if I run these three lines, we get repeat of what we did earlier. Miss classification is about 32% to improve these miss classification rates. One thing we can try is while developing the model we can make use of use kernel capital. T for true. So now, if we rerun, this model can see earlier. It was 29.5 so it has come down to twenty seven point Three. That means accuracy has improved slightly and with the testator earlier, It was thirty two percent. Now it is about thirty point, seven percent [Music].