Transcript:

hi! I am Mike Marin and in this video we’ll introduce “simple linear regression” using R. Simple linear regression is useful for examining or modelling the relationship between two numeric variables; well in fact, we can fit a simple linear regression using a categorical explanatory or X variable, but we’ll save that topic for a later video. We will be working with the lung capacity data that was introduced earlier in these series of videos. I have already gone ahead and imported the data into R and attached it. We will model the relationship between Age and Lung Capacity, with Lung Capacity being our outcome, dependent, or Y variable. We can begin by producing a scatter plot of the data plotting Age on the x-axis and Lung Capacity on the y-axis and we’ll add a title here. We may also want to go ahead and calculate Pearson’s correlation between Lung Capacity and Age. We can see that there’s a positive, fairly linear association between Age and Lung Capacity. We can fit a linear regression in R using the “lm” command. To access the Help menu you can type “help” and in brackets the name of the command or simply place a question mark (?) in front of the command name. Let’s go ahead and fit a linear regression to this data and save it in the object: mod. To do so we’ll fit a linear model predicting Lung Capacity using the variable Age; it’s important to note here that the first variable we enter should be our Y variable and the second variable the X variable. We can then ask for a summary of this model. Here we can see that we are returned a summary for the residuals or errors, we can see the estimate of the intercept, its standard error as well as the test statistic and p-value for a hypothesis test that the intercept is zero. it’s worth noting that a test if the intercept is 0 is often not of interest. We can also see the estimate the slope for Age, its standard error and the test statistic and p-value for the hypothesis test that the slope equal 0. You’ll also notice that stars are used to identify significant coefficients. here we can see the residual standard error of 1.526, which is a measure of the variation of observations around the regression line. This is the same as the square root of the mean squared error or Root-MSE. We can also see the r-squared and the adjusted r-squared, as well as the hypothesis test and p-value for a test that all the coefficients in the model are zero. Recall in earlier videos we saw the “attributes” command. Here we can ask for the attributes for our model, and this will let us know which particular attributes are stored in this object mod. We can extract certain attributes using the dollar sign ($); for example we may want to pull out the coefficients from our model. it’s worth noting that we’ll only need to type “coef” here and R will know that these are the coefficients we’re asking for. We may also extract certain attributes in the following way: here we’ll ask for the coefficients of our model. Now let’s go ahead and produce that plot we had earlier. If we would like to add the regression line to this plot we can do so using the “abline” command. Here we would like to add the line for our regression model; and as we’ve seen earlier we can add colours to this line as well as change the line width using these commands. It’s worth noting that we will need to do something slightly different to add regression lines for multiple linear regressions with multiple variables. We’ve already seen the “coef” command to get our model coefficients. We can produce confidence intervals for these coefficients using the “confint” command. here we would like the confidence interval for model coefficients. If you would like to change the level of confidence for these, we can do so using the “level” argument within the “conf.int” command. Here let’s go ahead and have ninety-nine percent (99%) confidence intervals. You recall that we can ask for summary of the model using the “summary” command. We can also produce the ANOVA table for the linear regression model using the “anova” command. Here we like the ANOVA table for this model. You’ll note that this ANOVA table corresponds to the f-test presented in the last row of the linear regression summary. One final thing to note is that the residual standard error of 1.526 presented in the linear regression summary is the same as the square root of the mean squared error or mean squared residual from the ANOVA table. We can see if we take the square root of the 2.3, we get the same value as the residual standard error, the slight difference is due to rounding error. In the next video in this series we’ll discuss how to produce some regression diagnostic plots to examine the regression assumptions: these include residual plots and QQ plots among a few others. Thanks for watching this video and make sure to check out my other instructional videos