Transcript:

Let’s fit a model to some data. These are the annual temperatures for the last hundred and twenty years in a fictional. Midwestern town? There’s one point per year the annual median of the daily high temperatures. And when we look at it, our eye is really good at pulling out a pattern. There’s a clear lift toward the right hand side. We’d like to capture that in a model. There are a lot of models that can represent this, but a really nice starting point because it’s so simple is a straight line. Here’s what the best-fi’t straight line looks like. It does a pretty good job. We can see that it definitely captures the upward tilt of the data, but it doesn’t capture the bend in it. It’s clear when we examine it that a straight line doesn’t do quite as well as we would like. Luckily, we have a lot of other options. A reasonable next candidate is a quadratic a polynomial with a squared term instead of just a linear term. These have some curvature to them. We can see that the best fit quadratic clearly captures the lift at the right-hand side of the plot and the bend in the middle, but it also imposes a little bit of lift on the left-hand side of the plot, which is not obviously reflected in the data so we can try other options. We can try polynomials with cubic terms, powers of three or we can look at polynomials with Quartic terms powers of four. We can also fit Polynomial models of order. Five polynomials have order six seventh order polynomials and eighth order. Polynomials also called Octave Polynomials, Useful Tipic for Filling Lols and conversations at parties. Now the fit appears to be getting better, but the line is taking on extra personality. It’s adopting some Wiggles. If we take this to an extreme, we can imagine a model that passes through every single data point perfectly. This model would have zero error zero deviation from our measured data. So does that make it? The best fit model models are useful because they allow us to generalize from one situation to another when we use a model we’re working under the assumption that there is some underlying pattern. We want to measure, but it has some error on top of it. The goal of a good model is to look through the error and find the pattern. The most common way to do this is to split our data up into two groups. We can use one group to train our model and then we can test it to see how closely it fits on the second group. The first group is the training data set. The second group is the testing data set. There are lots of ways to do this and we’ll revisit them later, but for now we’ll randomly sort out our years into two bins will put 70% of them into the training data set and 30% of them into the testing data set. Then we can go back to our collection of model candidates and try them. One by one here are a few of the models trained on the training data and plotted against the testing data as the models get to be higher order. We can see that the Wiggles they developed may have been helpful for fitting the training data, but don’t necessarily help them fit the testing data better. We can see an extreme example of this in the full Interpolation model where we just connect all the training data points with straight lines. It really struggles to match the testing data points. It’s helpful to look at the error on the training and testing data sets for each model lined up side-by-side looking at the errors on the training data Set. A few things jump right out. First is the wide gap between the training errors, the hollow circles and the testing errors. The solid circles right away. We can see that there’s a substantial difference between the two data sets second. There’s a precipitous drop in error, going from a linear to a quadratic model. That is a first to a second-order polynomial. This makes sense. When we were eyeballing it, we can see that the linear fit failed to capture the curvature of the data. One of its most prominent features. The quadratic curve captured that just fine, so which model fits best when we look carefully at the errors on the training data. It appears that the error on the fifth order Polynomial is the lowest. The differences are subtle, so you might have to squint. But all the other higher-order models have low error -. They’re just just a little higher than the order 5 polynomial, but as we mentioned, that’s not the ultimate test. The error on the testing data is what we really care about. Careful inspection of testing error shows that the fourth order model does the best job at higher orders of polynomials. The error on the test dataset goes up the more. Wiggly, the line gets in fifth and higher order Polynomial models. The more it captures the quirks of the training data rather than the underlying pattern of the testing data that were interested in based on this train and test approach. We have a clear winner of all the models. We tried the fourth order. Polynomial is best. Congratulations to us! We chose a pretty good model for our data, but don’t leave just yet. There are some pretty important ideas still to mention. Join me for part two where we’ll talk more in depth about what we want in a model.