Transcript:

Hello, so far. We’ve considered general linear models in which a continuous response is assumed to be normally distributed around a given model and, furthermore, we assume that the variance of our response around that model is the same for all levels of the predictor variables in that model. However, in some cases, our response variable cannot be considered continuous, but takes a different form and one example might be that it takes a binary form in which the response can either be yes or no, a 1 or a 0 drunk or sober, male or female, diseased or healthy etc. How do we analyze data where we can’t assume a normal distribution of our response? Well, the key method for understanding and testing hypotheses relating to binary response variables is the so called binary logistic regression. Here is an example of the use of a binary logistic regression. Here we have trees of different widths and a statement in terms of whether these individual trees are infested with a parasite or not now eyeballing these data, it seems that larger trees would not tend to be infested compared to smaller trees. How do we test hypotheses based on the sample data well? One approach is to simply apply a t-test or analysis of variance to look at the difference between infested and non infested trees. In terms of their tree Widths does the tree width a differ significantly between infested and non infested trees. However, that form of analysis only takes you so far in many cases when we apply methods to understand the response variable, which is binary, we also want to understand under what conditions will it switch from one state to another, so for example, under what conditions will it a tree? That’d be much more likely to be not infested than infested. Now the key method for understanding the conditions under which that switch might occur is the binary logistic regression in binary logistic regressions. What we do is we assume that the response variable we are measuring is the outcome of a probabilistic model whose parameters are the various predictor variables in this case. The tree width. So in effect, what we are doing is really fitting some form of sigmoidal model to our data, which is the probabilistic model that generates those data. What sort of line has these properties that allows us to include some form of switch? Well, clearly it has to be sigmoid el informal s-shape inform, and, and clearly it should be able to go both ways. Not just from left to right here, but also from right to left in this way well. There are a variety of mathematical formula that will give us sigmoid el curves, but the one here that is used in binary logistic because it rates so well to the original form of statistics that we’ve been looking at the general linear model. Is this form now? It could look a little bit scary, but I want to take you through it because it’s really very simple. It’s an exponential function where we have an exponent on the numerator and one plus the same exponent on that denominator, and here is the probability of being in one state or another and here. What you’ll see is that we’ve got a coefficient’s which are really very similar to an intercept and here, in this case, we have a gradient in front of the first predictor variable x1 and we can generalize it and have x2 all the way up to XK so in effect, this would look like a a multiple regression type model, But it’s all formed in part of this exponent relationship here. I should so note That binary logistic models will also allow us to include categorical predictors here, just in the same way as we can introduce categorical predictors into general linear models. Now, let’s have a look at what really happens and remind ourselves what goes on in the logistic regression model fit here. We have a response variable. This is whether an individual is dead or alive, and this is the concentration of the toxin to which it’s been exposed really does look like when we expose individuals to high concentrations, the toxin, they’re more more likely on average to be dead than alive, Although here we have one or two observations here, which they happened still to be alive. What happens in the logistic regression model is that we actually fit a probabilistic model to the data that is most likely to have generated these individual data points here, so here we’ve got more likely to be dead at these concentrations, and then there’s a switch to more likely to be alive at those concentrations so logistic regression tests for relationships between a binary response variable, yes, or no help feel disease, etc and predictors by estimating the coefficients of these simple equations here, these simple equations, the exponential type models, and what’s those equations doing with all the equation is simply telling is the probability of being in one outcome or another. How does it work well? Here is the effective model that is fitted to our data and we can see that it’s a rather difficult exponential form. There is a way to simplify it. The first step in simplification is to think about the probability of it being in one state divided by the probability of it being in the other state. That’s our first step, and that’s called an odds ratio the odds of yes versus the odds of no. If we take the log of that, it turns out that effectively we manipulated this to form a simple linear equation, so by doing some mathematical trickery, we can turn this exponential model into a far simpler linear model based not on P but log of P divided by 1 minus P. So, as I said, that ratio is the odds ratio and taking logs of an odds ratio to create a linear relationship is a transformation, it’s called a log it transformation so often you will see binary data associated with the term log each transformation and what we’re doing here is turning the relationship that would explain those outcomes in terms of the binary outcomes into a more linear relationship by doing this mathematical transformation. Now logistic regression is called a generalized linear model. It’s not a general linear model, but it can be turned into the format in which we can analyze as if it were a general linear model by an appropriate transformation, and as I’ve said, it will work for both categorical and continuous predictors in our model, so by some mathematical trickery, we’ve actually turned our model into a much more tractable form, so it is generalizable back into the classic general linear model by applying these transformations. How do we carry out a logistic regression? What what actually happens well? The first is it tests Null hypotheses, Of course, we have sample data and we want to make inferences about the population as a whole, So what we do here Is we test the hypotheses that the coefficients and that can be the effective intercepts that beta 0 or more, interestingly, and more importantly, any of those gradients like Beta 1 are different from 0 at the population level and what it uses is a Wald statistic, which is called a Z in R now. The wall statistic is highly related to other well-known statistical distributions in particular the T distribution and it’s compared with a chi-squared and if the null hypothesis of those population level parameters being zero, then that wall statistic will follow a chi squared with a certain number of degrees of freedom. So that’s the theory. Let’s have a look at it in practice, and this is data from Gary Paulus, which was reported in a really nice book by Quinn and Ko called experimental design and data analysis for biologists. The data refer to the presence or absence of lizards on islands of different perimeter to area ratios first of all. I’m going to read that data in. I’ll call it lizard data that I’m reading from a text file called Polish data and then I’m going to view the data and here we can see. Here’s the perimeter area ratio of the islands, and here’s the presence or absence of lizards on those islands next. I’m going to attach these data so that I can refer directly to the variables within the data frame. And, of course, the first step in any statistical analysis should be to visualize your data so here. I’m plotting a graph of the perimeter to area ratio against the presence or absence of lizards eyeballing these data. It really does appear that. Islands with small perimeter to area ratios tend to have lizards and yet those with relatively large perimeter to area ratios do not tend to have lizards of this type. How do we conduct a binary logistic model? Well, because it’s a generalized linear model. We call up the Finke function G. LM and here we’re relating the presence and absence of the lizards to the perimeter to area ratio of lizards on the island. Now we’ve got a binary response. The generalized linear model allows us to deal with non normal responses, but we have to tell it that we’re dealing with a binomial response, so we’ll say here family equals binomial now. Our knowing that the response variable is Binomial will automatically default to the log, it link transformation, which turns this into more tractable form, but it doesn’t help to. Actually it doesn’t go do any harm to actually State explicitly that we’ve got the log it link now having fitted the model. We’ve got a summary of the model and here are the key parameters we’ve got estimates of the equation that has been fitted the probabilistic model that’s been fitted to explain those binary responses and here are the two key tests of the null hypothesis here. We can reject the null hypothesis that the gradient is zero because the probability of us obtaining that Z value or more extreme. If the null hypothesis is true is really very small. We can also reject the null hypothesis that the intercept is zero, but that’s of less interest because that simply describes the the baseline probability of it being in this state or that state or of lizards being present or or absent, so we reject the null hypothesis that the parameters in the model equation have population means of zero. Now it may be of interest to re plot back on to those data. The overall probabilistic model that is thought to have been most likely to have generated those data. So here what we use is XV, which is a counter of our values of X going from the smallest to the largest in very small steps, and YV is the predicted value of the model for all those given values of XV and we’re saying here. The type is the response in that it’s not the linearized form in which we’ve got the log. It model we’re actually plotting it straight back on to the. Y values the observed yes or no responses and here we’ve got lines with simply drawing up between all these X and Y values, and I’m detaching for good measure. But what does these graphics do? Well, it draws this graph here. This is the fitted model to our data. Now, just to make sure that you understand exactly what’s going on, recall that these with the coefficients of the fitted model here in terms of the intercept and the gradient we can now look at the probability of the lizards being present here and how it’s been described by this fitted model here. We’ve got the intercept. There’s the estimate, three point six zero six and here we have the gradient of that relationship between the perimeter to area ratio and the probability present, albeit through this more complicated, exponential function you.