Transcript:

So in this video, we’re going to talk about model selection, so this gets to the discussion we had, you know, in the previous video about model complexity and and whether it’s a parametric model or a nonparametric model and in either case, usually machine learning algorithms have the ability to tune the the desired complexity of the model, so even though a nonparametric model can have arbitrary complexity, you still might be to control what the algorithm is actually gonna output, even though the hypothesis space. H is really big and contains really complex models. You can still probably have some settings that make the learning algorithm prefer lower complexity models. So then the question is how complex do we want these models to be, and in order to understand that we’re gonna have to think about ideas like overfitting and underfitting? We’re gonna have to think about generalization error and we’re gonna have to link about validation as a procedure for model selection. So here’s some scenarios you might think about, so suppose you have some training data and you run machine learning algorithm 1 right, so this is something that you don’t know exactly how it works. You just downloaded it off. The Internet and people say it’s a really great algorithm. So you wanted to try on your data and then you run this algorithm and you get predictor 1 right. This is your predictor that gets output from the model. And then you run it on your training data, so you want to go evaluate? How good this predictor is at the task you trained it for, so you go to evaluate it, right, you run on the training data and you get point zero to accuracy, so it did really bad or well. Actually, we don’t know. Maybe that’s actually relatively good, So you, okay, so that’s that’s fine and then you go to deploy it. So your boss says OK? I need this machine learning algorithm up and running on the website. Right now, so you go to deploy it. You get real world data. You run it on the real world data and you get 0.25 accuracy. OK, so that’s not that bad. I mean, it’s fine. Maybe 2 5 is good relative to like, if you maybe if it’s a if you’re trying to class out of a hundred different classes, my random guessing would be something like point zero one accuracy, so 25 might be pretty good, But let’s say you also try a different algorithm. So you downloaded this software package with another machine. Learning algorithm that people say is really great. So again, you give some training data and put it into this machine. Learning algorithm -. And you get predictor -. And then you go back and again evaluate it on what you trained on to see. How well did it learn like what it was trying to learn? So you evaluate it on the training data and you get 95 accuracy, but 0.95 accuracy or 95% accuracy. Okay, so so that’s that’s a huge number. That’s a, you know, almost perfect on the training data. So you’re really excited and you go to deploy to the real world. You tell your boss like, you know? Get ready to watch this and you, you hit, go. You deploy it and it gets you 0.27 accuracy and everyone laughs at you. Okay, so right, what’s happening here? And this is an example of the difference between you know, accuracy on the the data that you’re training on versus accuracy in in the real world or or the actual the risk, which we talked about a few videos ago, is the the generalization error. So in this case, we had a huge generalization error, wed wed. You know that we the amount of error in our, you know, prediction of how well we’ll do is you know, which is 0.68 which is something huge If I did my subtraction, right so so lets. So what’s going on here. What’s the difference between these? So let’s say, let’s say for the sake of you know again. This is just this holy imaginary cartoon example, but let’s say we want something around. I don’t know, Oh, point Five accuracy, okay or higher, and let’s say 0.25 is actually not that good. So if that’s the case, what’s going wrong with both of these algorithms? So what is going wrong with each of them or the with the first one is its is underfitting and the second one is overfitting and what that means. Is that right when it’s under fitting what it means is that it’s it’s not the model or the machine. Learning algorithm is doing a bad job of fitting the training data. I remember it was getting only something like a quarter of the examples correct, and we should expect that. You know if the model is, you know of the right representation of the data, you would be to do better than that. So so some of the reasons that can lead to underfitting are if the model is too low dimensional, right, this could be a situation where we just designed a mathematical model where there just aren’t enough knobs for the machine learning algorithm to tune to get get a good fit or it could be that is heavily regularize and what that means. Is that you know? I mentioned in the beginning of this video. That machine learning algorithms generally have a tool, a tune a tuning parameter that the machine learning user is able to select how complex you want to let the model be. And and this is to sort of trade off between these two problems that we’re showing right here, and if you tune that to if you tune the regularization too high, you’re forcing the machine learning algorithm to try to produce an overly simple model, and if that can happen when you do that, sometimes you under fit and then, Lastly, sometimes, even if you have that parameter tuned right, and you have the right dimensionality. If you’re modeling assumptions are totally wrong and don’t really reflect the real world or whatever is generating your data. Then, you know, no matter how good your learning algorithm is. If the model is, a space of hypotheses doesn’t even include a hypothesis that that accurately models the data. Then you’re gonna under fit, OK? So then, on the other end of the spectrum is when you have overfitting. So what happens is there? Is that essentially we saw in the previous cartoon example that we were getting 0.95 accuracy so it was doing almost perfect on the training data, so in the absolute extreme, you could imagine a training a learning algorithm that just memorizes all the data, so it just memorizes all of your examples from a training set and it will get 100% accuracy, right because it’s memorized every example, and and when you go to test it on the way, if you go evaluate your learning algorithm or the model that or the hypothesis that it output on your training data is just going to match it one-to-one everything that is memorized, but now, if it gets a new data example in in the real world or when you go to deploy it or if you value 8 it on some test data, it might be a there might be a data point. That’s very similar to one of the data points you memorize, but is not exactly the same and this memorization learning algorithm won’t know what to do with it so so that’s an example of overfitting, and and it happens a lot in in practice when you have, you know, high dimensional models, all right, it’s a model, that’s that’s very, very high dimensional, so there’s a lot of parameters for the learning algorithm to optimize over or to tune and or if it’s nonparametric right, it is a nonparametric model. Sometimes you can end up with overfitting and, conversely, to the under fitting example, it might be that you’re weakly regularizing this model. So where you have that parameter that you? As the machine learner or machine learning expert could have turned, you know, turned this knob to be more complex or or less complex. You set it to be super complex and it made it. Memorize the data and another possible sort of cause for overfitting is that you’re not making enough modeling detect assumptions. So this is something where you say well. Anything is possible, so let the machine learning algorithm do whatever it wants and that therefore, the space of all hypotheses is way too big and way too nuanced and the learning algorithm again is able to to learn a model that does too well on the training data but doesn’t generalize, OK, and then, Lastly, sometimes this can happen if you don’t have enough data, so so all the other things that I talked about very high dimensionality, weak regularization or not having enough modeling assumptions A lot of times that can be remedied if you just have enough data, but the problem is. If you don’t have enough data, then then you sort of then then you need stronger modeling assumptions. You need stronger regularization and maybe lower dimensional representation. So another example of this phenomenon is if you just look at fitting trends to numerical data, so a lot of times, you see people look at, you know, numerical data and it looks kind, noisy. Let’s say you were plotting this totally imaginary data on, you know, two axes. I mean, didn’t even label the axes because they don’t matter. It’s just just numbers, and you see this this, you know, as a human, you can see this look at it, and you sort of see a trend, and if you if you plot this kind of data in MATLAB, it actually gives you a bunch of options for data fitting and one of the more powerful options is something called shape preserving interpolation. I’m not exactly sure what it does precisely, but you can see it’s kind of, you know, interpolating between points and then when points are too close together, it’s taking kind of an average of multiple points and you get this very bumpy curve that that pretty much, you know, pretty much lines up all the points or the intersects all the almost all the points, except for one of the things are really dense and this, you know if you went in to show two scientists this plot, and you said we’ll look. I found the function that represents all this data that you collected. They probably wouldn’t believe you. I mean, just human as a human with our intuition. We can see that you know this shape. Preserving the interpolation method is basically trying too hard, right, It’s it’s basically trying too hard to fit the data exactly, and this probably doesn’t describe an actual trend in reality and then on the converse where this is an example the Green Line, the shape. Preserving Interpolation is an example of overfitting. And then the red line here that I’ve just added to the screen is a linear fit, right. This is the line of best fit. One of the the most canonical and most used methods of analyzing data in science is to is to do a linear regression. So this is the you know. The red line is the linear regression on this data and it like again into human intuition. Intuition is going to let allow you to look at this and say well. This doesn’t look, right. You know that that? I don’t really think I mean, okay, so here’s. What a lot of times happens in some of the the less prideful moments in in science, where we’re a lot of scientific results have been, let’s say, inaccurately reported. Is where you you do that? You plot something like this, and you do linear linear regression, You say? Oh, well, there’s a trend going up to the right, and, and maybe that might mean that some, you know some dietary trick leads to. I don’t know, reduction in disease, and, and, of course, you know, if you look at the data, you see that, okay, there’s no actual trend here and again and what you realize is that if you do something, that’s kind of in between the two, That’s that’s a little bit more powerful than the linear fit, which is under fitting and a little bit less powerful than the interpolation, which is over fitting. You can do a quadratic fit and it looks pretty good, and and, you know, it turns out that I actually generated this data by, you know, taking a quadratic function of a noisy, noisy quadratic function, So so you know? I cheated by setting up this problems that I knew exactly the model class that were looking for so so again. This is all based on human intuition, which you can just look at these images and figure out what we what we think fits better just by looking at it with the eyeball test. But how do we do this more systematically? We can we can’t. Just always visualize everything. I mean, some of these things are in high dimensions, so it’s hard to visualize, so we can’t just always depend on visualization. So so what’s the procedure for us to choose between? Let’s say these three models so to think about this. Let’s switch gears and let’s think about the classification problem again. It’s kind of easier to think about for at least for me, so I mentioned this. You know the nearest neighbor classifier as a non-parent paying an example of a nonparametric model, and it turns out that you know that that kind of crazy example, I gave previously of like a machine learning algorithm that just memorizes the data, That’s exactly what nearest neighbor classifier classification is. Well, not exactly, that’s that’s the foundation of nearest neighbor classification and what it does, is it? Basically, you know, stores all the data that it sees, remembers what label that data had and what it will do is when it gets a new data point, it will decide. What example is this most similar to is this right. Is this new input example most similar to and what you’ll find a lot of times? If you you know, use nearest neighbor classification, is you’ll, you know you’re definitely going to over fit in some sense, because at least for pure nearest neighbor that you can, you can make this less powerful if you do. K nearest neighbor. And that that actually helps you reduce overfitting. But if you just use nearest neighbors or just a single, you finds a new single nearest neighbor. You always get 100% constraining accuracy, right, because again, the nearest possible example is always going to be the true example, the the original example, But let’s say you get a new five or a new input that looks kind of like a five, and that’s not actually in your dataset. It’s gonna, then, you know, it’s gonna then try to match it up to things that look kind of like it like fives and sixes. Maybe it looks like an eight. Maybe it looks like a three, so a lot of times you see something like this where you actually get a relatively a much lower train testing accuracy than the training accuracy, so one way for us to not be shocked by this when we go to deploy our machine learning or our models that we learned from our machine. Learning algorithm is to do held-out validation. So that’s where we say. Alright, let’s take our training set and let’s just sacrifice some of that trainings that set right, we originally had in this case, 50 points, 50 examples of digits. Well, let’s just let’s take 1.0 of those digits and set them aside. Okay, and then we’re gonna train on the 40 digits that are left over and never look at that. The the held out data never look at that that last 50 or last 1.0 digits that we held out and what we’ll do is we will then train on the you know the the remaining data and then try different models, we can then try different different models using this data and then evaluate it on the held out data and what we are trying to do here is we’re trying to simulate experience we would had have if we were to train on all our data and then just go and deploy it, and then you know? Bank, all of our, you know our money on some untested algorithm, so in this case, we can test it in the simulation environment and what we might see is something typically, that looks like this, so we usually see a simple model that does reasonably well on the training data and does kind of bad on the test on the validation data. We might see a medium complexity model that does even better on the training data and then does also does better on the validation data and then we might see a complex model that does really really well on the training data, but then does even worse on the validation data, and then Lastly, you might see a super complex model where we can just completely memorize the training data and we do terrible in the validation. So that’s, you know, the usual pattern you’ll see when you when you go to do this. And this is also not super robust because all we’re sort of depending on the one set that we held out that valid validation data on the right to be representative of what our experience will be when we go in and deploy the the machine learning machine learned hypothesis in in our real world system so instead we can actually sort of repeat this process over and over again with different folds as they’re known. So well, we’ll hold out the first ten digits and then eval and then train on the rest of them, and then we’ll hold out the second and then we’ll train on the rest, hold out third and trained on the rest, etc, etc. And this way we, you know, at some point, we will be using every data point as a validation example and you can. You can choose. How many folds you want to do? You might only do two if if you’re lazy, or if you really want to be, you know, exhaustive, you can do N folds for N data points where you know you. You are leaving out? Only one example you trained on all N minus one examples and you evaluate on that one example, and you only you’re only gonna get your accuracy at that point will either either be 1.0 or it’ll be zero, but then what you do Is you keep doing that for all the different examples, and then you take the average or you take. Some statistics of you know how often you’re correct on? Leave one out cross validation so that this point is probably useful to think about, you know? I mentioned that you can start with you. I showed examples of five foals, and I showed examples. N folds for the leave one out. I said you could do two folds so so. How many folds should you do? So you should think about you? Know what are the pros and cons of you know? Leave one out, cross validation and relatedly. You know what why would we? Why would we be interested in different folds using fewer or more folds and then second secondarily another question? You might wonder and we should discuss in class is well so. I’ve described so far that what we’re going to do is we’re going to take N folds and we’re going to hold out one of them and train on N minus 1 folds and then evaluate on that one held out fold and that that gets us, you know, it gets us a small fraction of evaluated points and a large fraction of training points so so I present presented it this way because this is usually how people do it, but there’s not it’s there are also advantages. There may be advantages to doing it the other way where you train on one fold and you test on N minus 1 folds. So let’s think about what the pros and cons are of between that distinction, right, so you might want to switch things, okay, and for experiments when we’re not talking it so. I’ve been giving this sort of story where you’re working in some industrial application, and you want to deploy the algorithm or the display The predictor that you learned or the model that you learned from your machine learning in practice, but when you’re doing scientific research or when you’re if you’re doing the research on machine learning itself, usually what happens is you have a big data set and you, you just want to evaluate how well does my machine? Learning algorithm do on this data set and the best practice is that you will put the data into a test set that is completely hidden from training, and then you’ll use the value, use validation or cross-validation or leave one out validation on the remaining data, So there’s actually like three. At least three splits. There’s the training data. There’s the validation data, and then there’s the test data and you usually rotate what the training and validation sets are. And this way, you know you you. You have this set of testing data. That is you never never touch with your training process now. Sometimes people will just report Cross-validation score to show how well they’re machine learning algorithm is, but it’s not, it’s a it’s it’s not the ideal thing to do in particular when you’re using cross-validation to decide between models or tune a particular model parameter like a complexity parameter. Okay, so in class, we’re going to talk about a couple things where lets thought let’s we’re going to think about different scenarios, so these are situations that happen every now and then in research or when you’re going to try to run a machine learning algorithm and and they’re kind of puzzling, so it’s important for us to think about them and discuss them together and try to understand what it means when these things happen. So the first scenario I want you to think about. Is you know what happens? If you run cross-validation and you’re you try cross-validation on. You know, different values of some complexity parameters. So you’re tuning, you’re choosing how complex you mock you on your model to be, and it turns out that as you tune that complexity parameter, you get really erratic scores. You know, so you you just get like like in this plot here you get, you know, for one setting you get 0.8 for one setting you get 0.9 and then usually increase the parameter A little bit. You get something like point. Two, then it goes back up to point nine. You see some really really weird behavior, so let’s think about. What does that mean when that happens? And then the second scenario is if let’s say the cross-validation is nearly uniform, so you you tune your complexity parameter and your score from cross-validation just seems to be basically the same. No matter what you set that parameter to. So what do these things mean? You know, are they are they? Are there remedy? Are there remedies for these situations or or is there is something that we knows has gone wrong. Okay, so to quickly. Summarize validation is this fundamental idea in in the process in the methodology of machine learning, which allows you to measure performance by holding out training data or holding out the available data and what what the goal is basically to simulate the testing environment simulate the process the original process. I showed you where you train your machine learning model on your training data. And then you just go and send it out into the real world and hope it doesn’t bankrupt your company. So the idea is, you’re going to rotate folds of held-out subsets, and you can even hold that one at a time, so that will be leave one out validation, and you can use this cross-validation on your training data to tune extra parameters.