Transcript:

Okay, now let’s create, uh, some of these numeric numeric statistics that we were just talking about. I’m going to call this bivariate num num. And this is the we’re gonna do all the, uh, analyzing right here. So first of all, let’s do some basic stuff. I’m gonna import Numpy. Oops, we like numpy for basic math stuff. As NP and let’s just create a couple of, uh, lists, so let’s make a height and well set. We’ll put in a few values here, 60 62 65 68 70 and 74. And then let’s make some weights and well set these equal to 140 38 150 166 190 250. All right, so let’s start with a simple correlation between white, uh, height and weight using numpy so numpy dot core co-f and we pass into it simply the two lists, so it’s two, uh, two features or two variables or a feature and a label and they’ve got to be the same length, so the same number of digits in each of them lets. Go ahead and run that there. We go gives us always a correlation matrix back so because we passed in two features, it’s basically the equivalent of saying, Okay, we’ve got height here weight there, height on the top weight there, and these two ones are here because the correlation between height and itself would be a one and a weight in itself would be a one and the correlation between height and weight is actually 0.92989 that big number there. It puts it both in the lower right left and upper right here. Same thing on both sides. Okay, thanks, Numpy. Let’s just grab and print out of there. Just the correlation we’re interested in. I’m going to adjust that by saying go first, this first index says. Go inside this array right here, So that section of it is in the zero position and then we go to number one to get to the actual correlation coefficient. After that there we get just that just that number so numpy often we’ll be not using numpy, but maybe something like pandas, so let’s import pandas instead and the advantage of Numpai is it’s often a bit faster than pandas not always, but we always already use pandas for so many other things, so let’s go ahead and learn how to use it in there, too. I’m going to say DF equals and go ahead and pull in my data set Pdread csv I’ll pass in [Music] csv there we go and simply dfcore that will look through our data set and grab every numeric feature and add them all to the matrix here, so it see how it just automatically ignored sex, uh, region and gender or sorry and, uh, smoker and, uh, gives us the same diagonal of ones, so we don’t even consider the correlation between age and age BMI and BMI. It just gives us a place for our eye to see the diagonal. That’s all wherever we see the ones, and then we get the the correlations between age and BMI we’re really just interested in these three right here charges our label and the other three features. You don’t care so much about these other correlations because they’re between features. Now there’s other reasons why we might care that we’ll come to later in the book, but we’ll worry about that for now, all right, so how about if we just want the correlation between, uh, an individual pair and pandas DF dot, let’s grab charges Dot core, so we first referred to the series. We want Dot DF BMI, for example, there we go. It gives us the individual correlations. Same one that you see right here. That’s rounded 0.198341 Okay, cool, we can do. We can get correlation, but what we really want Also often with the correlation is the p value so to do that. Let’s go to a different package entirely, let’s use. Um, one that you. We probably haven’t used yet. In this book, we use one called scipy from Scipy import stats and lets we’ve already got pandas, so I will import it again. I’ve also already got the data frame, so I won’t pull that in again. Instead, I’m simply going to say core equals, and I’m going to use. The statspackagepearsonr is the name of the function and we’re going to pass in two lists once again charges. DF dot age. Okay, all right, that’s processed. I should print it out core there. We go all right, gives us back a pair of values a correlation and then along with it, a p-value for that correlation. All right, interesting. Lets, uh, let’s round both of those so to do that, I’m going to use a notation that you may or may not have seen before whenever we get two or more than one value that are always in the same format In this case, it’s always correlation first P value second. What we can do is set that the result equal to two variables like this, so R will get this first value and P will get the next value because that’s the order listed here in this list. So afterwards, let’s just simply print R and print P. Okay, cool now! We’ve got them separately, but now I’m gonna also round R to four decimals. Same thing with this P and P turned to zero. Why, because the P value is so small that if you go out four decimals, they’re still all zeros well. Do you remember what the number was? How far we had to go out at 20 decimals? It’s still zeros. How about 25 still zeros? I think 25 was the cutoff. Let’s go to 27 no. It was something more than 27 I wasn’t paying attention 40. There. We go 29 lets. Change this to 29 there. We go, so it’s 29 zeros, then five. That is a very small number, So what that means? Is that this correlation of 0.299 it is extremely likely that we’ll see that correlation again the next time we collect 1300 more samples if they’re randomly selected from the same population, We got our first 13 from 100 from in this data set right here. If that we’re hot, another way to interpret that is that it’s a low likelihood that we won’t see it again very small likelihood that we won’t see that number again. Okay, well, lets, uh, let’s take this a step further now and let’s go ahead And calculate that whole Y equals MX plus B formula that we did previously we’re going to continue to stay here in the Scipy package and [Music]. I’m going to make a data frame for this output, so I’m going to call this core DF and set this equal to Pddata pddataframe and let’s put columns in here. Let’s call this R for the first one P for the next one, and let’s just go and loop through for call NDF now. We don’t want to do this for every column. We only want to do this for the numbers. I’ll show you what I mean here in just a second, I’m going to go ahead and calculate or grab. Let’s first just do this. Um, let’s do the correlation and put it all on data frame. We’ll come back to y equals MX plus B in a minute. So for each column, we’re going to say DF charges and DF. Now I can’t just do call here with a variable name. It’ll give me an error. Whenever I’m using a variable, I have to use this index notation, so we’ll calculate R and P and then we’ll say, oh, not cough core core DF. Dot lock now. The dot lock refers to the row index. What should we call the row index? Let’s call it the name of the column that we’re looking at, and let’s set it at equal to lets round R to three decimals and Round P to three decimals. Okay, when that’s done, we’re gonna print out Core. DF Now. We’re gonna get an error why it says, uh, you? Funk add did not contain a loop signature matching types. In other words. It’s saying, okay, One of the columns were trying to calculate a correlation for is not numeric. See how just because up here. It automatically ignored Non-numeric data types. Well, that’s built into the core function. We didn’t use that function down here. We use stats, Pearson, R. And so we need to tell it to ignore anything That’s not numeric, so let’s say if Pdapi dot types dot is, I always forget num numeric d type there. We go and what do we pass into there well? If we pass in call, that’s going to give us a problem because what is call call is nothing more than a string just to see what I mean. Let’s print out call each time. So, oh, I gotta forget. I gotta remember my colon, all right. It prints out the column name, but it never goes inside this. If that’s because the call is nothing more than a string age sex BMI is just the name of the column. So if I wanted this function to work, I need to pass in an entire column of data and not just the name of the column header. How do I do that? Well, that’s where I refer to DF. And then I pass in call as an index and that will pass in the entire data type, so now we it actually goes into the if each time there’s a numeric column. Well, we don’t need the correlation of charges with itself, so we could do something like, and not, uh, charges or something like that, um, or and call not equal to charges there. We go so it ignores charges that time. Let me get rid of my print here. I don’t need that anymore. Those just to see what call was, and there’s my nice little, uh, data frame cool. Well, lets, uh, let’s keep going, let’s learn. Um, uh, y equals mx plus b.