Transcript:

The t-test was developed by William Sealy Gosset he worked at the Guinness Brewery over a hundred years ago and he developed this test to determine things like the difference between barley yield. Now he wanted to publish this statistical test to share with other statisticians, But the brewer was nervous. They didn’t want him to publish. They didn’t want him to give away any secrets. He finally convinced them, but he had to publish under the pseudonym student so instead of this being known as the Gossett’s t-test, it’s known as the so in this video. I’m going to start by conceptually showing you what the t-test is. I’ll then show you how to calculate the T value. Run a t-test. And then finally how to do a t-test in just a few seconds using a spreadsheet. So imagine I have two fields of barley field, one and field two, and I want to compare them, but I don’t want to cut down the whole field. I just want to do some samples samples from one samples from two, so I could get a sample from field. One now won’t be a perfect, normal distribution like this. It’s going to be more of a histogram that looks like that and then. I’m going to get a sample from field two. Now, which one of these has a higher yield? Well, we could figure out the mean we could figure out the average of each of those samples, and it looks like the average in field two is higher than the average in field one, but that’s only part of the picture. The mean only tells us so much because we could have different distributions and depending on that distribution or the variance within that sample, there could be a statistically significant difference between the two or not, and that’s where the T value comes in handy. It’s really a ratio of signal to noise signals going to be numbers that tell me the difference between these two samples and noise is going to be numbers that kind of get in the way. So how do I figure out the signal? Well, the easiest way to do that is simply find the difference between the two means. And so if I calculate the mean and sample one, we’ll call that X bar, one and X bar two, the absolute value or the difference between the two is going to tell you how much signal it there is. How much difference there is now? How do we get at the noise? That’s going to be in the variability of the groups themselves and so the factor is going to look something like this. What is s 1 that’s? The standard deviation. Remember, that’s. How far our data is spread from the mean, but we’re not only again. The standard deviation we’re actually squaring that that gives us something called the variance and so if I increase the variance, that’s going to lower my value. It’s like giving me no more noise now. The other factor in here is going to be the number of samples that I’m taking as I increase the number of samples that will actually increase the signal up to a point and so again, the difference between the means is going to give us more signal higher T value and increasing that variability is actually going to decrease it. So let me show you how to calculate that. T value I’m using Excel, but you could use Google sheets or even your TI calculator. So if you look at these two samples from field1 and field2, can you tell which one has a higher yield? It’s really hard just looking at it. Is there a difference between the two? And how much is that difference? I will use the T value to calculate that first thing we have to figure out. Is the mean so in a spreadsheet, you hit equals average instead of mean. I’m going to put left parenthesis, and now I’m going to select that entire sample set from field, one and parenthesis, and then I’m going to get a mean of fifteen point. Three eight. Now I can select that and drag over, and now I get a mean of fifteen point Six eight in field two. Next thing I have to figure out is my standard deviation. So that’s equals Stdev left parenthesis. I’m now going to sample that field one. And now I put in parentheses, so we’re going to have a standard deviation of 0.3 one to four. I’m now going to apply that into field. Two’s data set. So we have a higher standard deviation, remember? I now have to calculate the variance to do that. You have to square the standard deviation. So I’m going to select that cell and bring it to the second power. So there’s my variance for field one, and now here’s my variance for field two, and then finally. I have to know how much data I’m actually collecting. So if you hit equals count, that’ll count the number of data and so I’m going to count those and we get 16 so we got 16 and then it’s going to be 16 in the next one as well. Now you could use a spreadsheet to calculate this. You could do it by hand. It takes a long time to figure out standard deviation by hand, so it. I encourage you to use something like a spreadsheet now. I have all these values. I’m simply going to plug it into my T value like that, so we’ve got the signal on the top, so I’m going to find the difference between these two, and then I’m going to figure out my noise on the bottom. Remember, you have to divide this. Add it and then take the square root of that. So if I do the work for you. We’ve got a signal of 0.3 0 an O 0 point 1 3 so I’ve got a T value of 2 point 3 What does that mean since it’s higher than 1 that means there’s more signal? There is noise, so I’m going to put that over here to the side Because this video is not about the student’s T value. It’s about the students t-test. So now we’re going to run a t-test. What are we testing? We’re testing our null hypothesis, Just like we do in a chi-square test. What we’re going to start with is a null hypothesis That says there’s no statistically significant difference between the samples in other words, any difference that we would find is? B simply due to chance you then identify a critical value, a number if our T value is lower than that, then we don’t reject our null hypothesis, but if we get a T value, that’s higher than the critical value, then we reject our null hypothesis. There must be an alternate hypothesis. There could be something going on between these two fields now. How do we find that critical value? We’ll use a tea table. That looks like this. It looks confusing, but it’s really not that bad, so this would be for a two-tailed test and I’ll show you what that means in just a second first thing you have to know is. What probability are we going to use? Generally in science, we’ll use the point. O 5 probability. So that’s going to be this column right here. What does that mean well? This is an inferential statistic. It means if we were to do this sample. A hundred times 95 of the times we would reject the null hypothesis and only 5 percent. We wouldn’t, and so it has a lot to do with chance so. I’m going to use that point O 5 Now. We have to figure out what row we’re going to do and to do that. We have to know how many samples we collected and figure out the degrees of freedom, so the degrees of freedom is going to be the samples of N 1 and N 2 minus 2 so since we took 16 from each, it’s going to be 32 minus 2 or 30 degrees of freedom. So here’s our critical value. Our critical value is going to be 2 point 0 4 So is our T value higher than that. So this is where we’re doing the actual T test. Are we higher than two point zero four we are, and so what does that mean, we’re going to reach our null hypothesis That means there is something statistically significant between these two sample sets. Now it’s not much higher than that. Remember, it’s just 2.3 and if we were to look over here to the 0.025 probability, we can see that we’re actually lower than that, so we’re not positive, but we’re pretty sure that there’s something statistically significant between these two. Now that was a lot of work we had to calculate our variance. Our means our sample size and then find this table. The nice thing about a spreadsheet is it can calculate a t-test very quickly. So what I’m going to do is put t-test here and then I’ll just write in this next cell equals t-test, so there are four things. I have to put in for a t-test. The first one is going to be. My sample set will say from Field 1 so. I’m going to select that. Then I put a comma in now. I’m going to grab my data set from Field 2 and then I’m going to put another comma. We’re doing a two-tailed test and this is an independent test. I’ll show you what that is in just a second, but you can see in just a few seconds. We’ve calculated my probability or my p-value. What is it it is, point zero, two, six. What does that mean? It’s slightly above point zero two five and somewhere in between point zero five and point zero two five. What does that mean in? Just a few seconds were able to realize that we need to reject that null hypothesis, so it’s really simple in a spreadsheet to do a t-test very quickly. Now we did an independent t-test or an unpaired sample. What does that mean? We had two different fields that we were comparing so you could be comparing, for example, two different populations. You can also run a paired t-test and you would have to select that when you’re running the t-test. What would that be is if we’re sampling the same population twice, so maybe we’re looking at field two, but then we’re applying a chemical and looking at it again that would be a paired test. We’re also doing a two-tailed test and so when we’re figuring out that probability of 0.05 you can think of it like this. This is the point nine five that we would reject the null hypothesis and that point. O Five is actually split between the two tails. Because we’re not sure which direction that variance is going to be. You could also run a one tailed test if you’re sure of the directionality, but you have to be cautious when you’re running. There are a few assumptions you have to have when you’re running a t-test number one. We should have a normal distribution in both the population and in the sample, but it works really well with a small sample size. We also should have similar variants in each of those samples and then when we’re looking at the data points, we should have roughly the same number of data points on either sample, and then finally this works good with low numbers, but you generally want to be in the 20 to 30 range when we’re looking for samples if we go much higher than that instead of using a t-test, we’d actually use a z-test. So did you learn everything? I showed you well, now’s a chance to practice it. I’ve got a sample set over on the left side. Imagine we have two plants plants from population A AND B and let’s say we’re looking at the leaves that each of those plants have. Is there a statistical difference between those in plant A and plant B. So you should run a t-test? I’ll put a link to an excel file down below and then once you figure it out. What are you trying to figure out again? Do we not reject, or do we reject the null hypothesis? I’d love to know what you think. Put that in the comments down below. And I hope that was helpful.