Transcript:

Earlier, we discussed testing for a correlation, which is a good way to see if a relationship exists between two continuous variables, X and Y, In our example, we were able to test whether fuel efficiency of a car was related to its weight. But what if you want to turn this relationship into a prediction, for instance, what would be the fuel efficiency of a 4,500 pound car? We can do this by fitting a linear model or linear regression, which is done in R with the LM function, so let’s save this linear fit to a model to a variable by doing fit equals let me do. Lm for linear model thread the CS. Now I put the formula with testing, which is miles per gallon, explained by weight. Remember that tilde means explained by and the data were working with empty cars. This saves the fit now. We can look at the details of it. With the summary function summary of fit, this output contains a ton of information even more than the t-test. A correlation tested. Let’s look at a few parts of it briefly. This first part shows the call. That’s the way that the function was called, Miles-per-gallon explained by weight using the empty car’s data. This summarizes the residuals. That’s how much the model got each of those. Its predictions wrong. How different the predictions were than the actual results this table. The most interesting part is the coefficients, this shows our actual predictor and the significance of each of the predictors. So first we have our estimate of the y-intercept. This is showing! What are the hypothetical miles per gallon would be of a car the way zero in this linear model, we also see the weight if it’s the effect of the weight or the coefficient of the weight or the slope, so this shows us a negative relationship, increasing weight decreases. My miles per gallon, in particular increase in the weight by a thousand pounds would decrease the efficiency by five point three miles per gallon. The second column is called the standard error. We won’t examine today. But in short, it represents the amount of uncertainty in this estimate. This third column is the call the test statistic, a mathematically relevant value that was used to compute the last column, which is the p-value, describing whether this relationship could be due to chance You might notice that the p-value for weight for weight. One point two nine times ten to the minus ten is exactly the same as it was for our earlier. Correlation test. That’s because we’re testing the same trend. We can extract this matrix of coefficients using the co. F function. So that would be here. Co F of summary of fit from that. We get a matrix. If you want to extract out just the estimates, just the y-intercept and the slope, the wires have done the coefficient for weight. We would get the first column of this matrix. Save it to a very a vector, so I save it to a Matrix Co-equal’s clef and get the first column. Ko, I’ll call. This is how we get the first column of matrix estimate of y-intercept and weight. If we want to get the p-values, we would get The fourth column. P-value for the y-intercept is not zero and the p-value for the weight relationship, the advantage of a linear models that can be used not only for statistical testing, but also for prediction. This model predicts a gas mileage for each of our existing cars using the predict function predict of fit. You’ll notice it for each of these four earlier cars that we have we get one prediction of the mouse with our gallon based on this linear fit now. These predictions aren’t really that useful to us because we already have the actual gas mileage of each of these cars, but what if we want to predict the gas mileage of a car that has a weight of say, 4,500 pounds we could do this by adding together the intercept term and the coefficient estimate times the weight. So if summary of fit looks like this, we can add together in the intercept term thirty-seven point, two eight five one, plus the the weight coefficient, negative five point three, four four five times our new weight, which is 4.5 thousands of pounds. This would predict a fuel efficiency of thirteen point two miles per gallon. This is what a linear model actually means. It’s a linear combination of the intercept and the slope. Now there’s a shortcut for producing that reducing this value from the fit, using the predict function first, we create a data frame containing the predictors. We wish to use in this case. Imagine we had a new car. That was new car equals data. Doc, frame inside here. We put the weight weight, equals 4.5 Now that we’ve created this data frame, we do predict on both our fit and our new car. This calculates the same estimate. Thirteen point, two, three, three, five predicting this. This car is miles per gallon using this fit. Finally note that we can show a linear model on our plot using a method built into Ggplot2 Geum underscore smooth, so when we had a earlier plot, his GG plot GG plot, Mt Car’s weight was on the X-axi’s miles per gallon was on the Y-axis. We say it’s a scatterplot giome underscore point. Now, let’s add to it. The lair of Geum Underscore smooth and held the method we wish to use is a linear model. The same one we’ve been learning. Now we get a linear trend on our. GG plot, The gray area shown is the uncertainty in the fit. It’s a 95% confidence interval of where the true trend line could be. It’s worth noting. This is not a perfect linear fit. We can see it values both at the low end and the high end have a tendency to be higher than we would predict dealing with. These issues is beyond the scope of this lesson.