Transcript:
What, I want to look at here? Is you know what what is it doing? What does the scatterplot showing you? What it is doing is looking for this association between two variables, so just bit like last week. We talked about how one variable changes with another we’re doing the same with a correlation. How does this variable change with another variable? For example, how does population size relate to GDP gross domestic product do countries with bigger populations have higher GDP S? Well, we might expect there are other things being equal, but of course, that’s not quite exactly the case. Some countries are very rich and have much higher GDP per person than others, but the bigger the population, Of course in general, you might expect thee that the gross domestic product of the country would be bigger, crime and opponent rates were all now an idea that that adds unemployment rates rise. So does crime. You know the frequency of crime? Actually, that hasn’t happened In recent years, Interesting in this country, unemployment has gone up, but the crime rate’s been going down. So it’s not always the case, but we want to investigate that that relationship. How does the rate of crime relates to unemployment in the country? And of course, the most obvious one is its height and weight. Someone’s height is related to their weight or their weight of later their heart. The taller people are the bigger they are, and therefore the heavier they tend to be, but it’s not an exact relationship. Obviously some people are very tall and very thin and they will weigh less than and very tall people who are thin and so on. So what we’re doing here is looking at the way that one of these variables how the values of it relate to the values of the other variable to do this kind of investigation, the scatterplot and to use the correlation the statistic on it, both of the variables must be as I said here, rank, interval or ratio and this is the same. If you’re using SPSS scale or organoid’s is the way to you, why SPSS talks about it? So the important thing is we can’t use correlations on the things that we were using last week for the cross tabulations, so the categorical variables. I’ve said here things like ethnicity, gender town of birth occupation and so on you can’t use in scatter plots and correlations they have to be in order at least so whatever it is, we’re talking about has to be in some kind of order and ideally, some kind of scale, so things that aren’t in any particular order like town of birth, For example, it does make any sense to correlate, and, and, of course, remember what I said Last week, SPSS will do it for you. If you put in those variables, it’ll grow a generator, a scatter plot and so on a scatter plot rather, but but it won’t make any sense, so you have to know what you’re doing have to remember that those kind of categorical variables can’t be used in scatter plots and correlations, so it’s going to be rank interval ratio. Okay, so here’s an example of what the scatter plot looks like, and I suspect you’re familiar with this. You’ve seen this kind of thing many times in in in newspapers on on TV and so on, let me just go over. What it what it shows you it In this case, it’s aging years versus number of of Gcses. I wonder where that came from its. The data set. I’ve used many times before from years ago and what it’s showing is up the left-hand side, so this axis here is showing the number of Gcses, so the people with the most Gcses. I think it was at 13 up here, or maybe that’s 14 That’s the 13 and so on and the ones with with none down here, so that person had no Gcses, so that’s the the up and down axis. The what’s called the Y Axis and across the chart is age, so it’s going from age 20 years to the lowest ages there, and it goes across up to somebody over 50 over there. So the chart has a dot or a cross on it for each person in the data set. I think from memory. There were 34 people in this dataset, so there’d be 34 crosses or dots on on the on the the chart and each of them is indicating for that person, their age and the number of Gcses They got in this case. There’s not a very strong relationship. You might say there’s a beginning of a trend of saying that that the younger people got more and the older people. So the older people over this side got fewer Gcses, but it’s not very strong relationship. It’s not a very clear picture at all. Fortunately, they’re not all like that. Okay, so let me go through some of those interpretations of the scatterplot. This one’s are much nicer. One, it’s a made-up one so variable. Y the up and down and variable X across the base. As in this case, you’ve got a series of points going from bottom left to top right, not exactly on the line. The red line is the the best fit, but what’s called the regression line? A bit of statistics give you that the best fit line, that’s it. Not every point is on that line, so we haven’t got a perfect relationship between the two variables, so we know that some people are around about the same on variable. Y but have different values for for variable X. So for example, these two people here are the same on variable Y, but they obviously have different values on variable X. The other way around you could say here’s a couple of people. Those that one there that one they are the same on variable X, but they have clearly different values on variable Y They are different, but they’re not very different. So there is some relationship going on here as variable. X tends to increase, so does the value of variable Y and the word the shoes for that is it tends to correlate with so as a correlation between variable X and vary Y not a very snot, an enormously strong one, but but but but not a weak one either is, it’s a middling one as we’ll. See in just a minute, okay. Now, the important point about correlation going back to this diagram. What we’re saying is that one variable tends to be related to the other variable. So I go back to my previous one. The the age in years tend to be related to the number of Gcses, but not very strongly here when we say that we are not claiming causation, an important point to remember about correlations when you find a correlation if it’s a weak one or a strong Y doesn’t matter, but any kind of correlation we’re not saying it’s causation we can’t infer from discovering a correlation, however, strong, it is to causation to saying one thing causes the other, so we might find that the older child is the better she is at reading. We know that, and we might know from other things that age does make a difference here. We have various common logical theories that tell us that as you get older, you’ll get better at certain kinds of things, so there is a correlation and we know there’s a causal connection, but we can’t infer that from the correlation and likewise, we might find that the less your income, the greater the risk of schizophrenia. I suppose it’s possibility of a correlate of a causal connection here. You might say if you get schizophrenia, you’re less able to earn a salary and therefore your income is likely to be lower so that might explain it, but from the correlation, you can’t infer There has to be some other information somewhere that tells us that so just look at this one height and weight, we know that there’s a very strong correlation between height and weight. We can see that obviously, but we know that weight does not cause height when a weight you are doesn’t cause you to be a certain height that doesn’t happen, but maybe height is is is some well contribution to your weight. Well, yes. It probably is the taller. You are your bigger bones, etc. That the heavier you’ll be, but there are other things as well other causes of whatever weight you are, and it might be body shape, your diet, your fitness level, all sorts of things contribute to it, So the fact there’s a correlation doesn’t mean there’s a causal connection or even that there’s a single causal connection. There might be several causes for that correlation. That’s the important thing to bear in mind for some of the interpretation of a correlation, so you find a correlation, you think there’s a relationship between one variable and the other, but it doesn’t necessary. Explain it all there may be other things that are acting as well. So bear that in mind, and sometimes you find a correlation and there really isn’t any relationship at all, at least not as simple for the relationship between the two variables, so have good strong correlations, but no cause, so here’s the last example which looks at that ice creams and the number of ice creams sold correlates with an the the rate of drowning. Now there’s no obvious direct linkage, though. It’s not that, you know, having an ice cream means that when you go swimming afterwards, you suddenly drown or something like that or even vice versa that that, you know, it’s both how the the drowning could he be any way related to buying ice creams, but because if you’re drowning, that’s it, you’re dead. You won’t want ice creams. But the point is actually that that correlation is what’s called a spurious correlation. It’s not the one thing causes the other What’s going on here is something else that’s causing both of those things to rise together, and we know that it’s it’s it’s warm weather, Basically, as the weather gets warmer, more people go swimming and therefore more people drown, there’s always a small portion of people who who have accidents and, of course, as the weather gets warmer. The sales of ice creams increases. So this is what? I called a spurious. Correlation need to watch out for this. You’ll find these kind of correlations, but they don’t necessarily explain anything. There’s something else going on, which is actually causing both of them to to rise together or to to drop together. Okay, so that that’s the general lesson about correlations it looks for relationships, but we can’t simply infer a causal relationship from the correlations, Of course, with other information, we can build up a model, then that might explain it, but but from the correlation alone, we can’t okay, so let’s look how to do a scatter plot how to produce one of these charts that looks at the relationship between these two variables. I think I’m going to try the first one. So eight minutes long.