Transcript:
Overfitting is a very common issue in the field of machine learning and l1 and l2 regularizations are some of the techniques that can be used to address the overfitting issue in this video. We’ll go over some theory on what exactly l1 l2 regularization is, and then we’ll write a python code, uh, and we’ll see how a model which is an over fit can be addressed and the accuracy can be improved. When you use l1 and l2 regularization We will be using a housing price data set from the city of Melbourne and we’ll first build a model using a simple linear regression and we’ll see it is overfitting the model, and then we’ll use l1 and l2 regularization and we’ll see how we address that overfitting issue and how it improves the score on our test set, so let’s get started. Let’s say you’re trying to predict number of matches spawn based on the age now, usually when the player gets aged, any sports person or athlete gets aged the matches one kind of reduces so you can have this kind of distribution where to build a model you can create a simple linear regression model and the equation might look like this. So matches 1 will be theta 0 plus theta 1 and into H so theta 0 and theta 1 are just a constant. This is a regular, uh, linear, like simple, linear equation, but you see that this line is not really accurately describing all the data points. It is trying to find a best fit in terms of a straight line, but you see, all these data points are kind of going away, and then if you have test data points, which are lying somewhere here, then this is not a very, very accurate representation of our data distribution. Then you can build a distribution, which might look like this so here we are trying to draw a line which kind of exactly passes through all our data points, and in that case, your equation might look like this, so it’s a higher order higher order polynomial equation where you are trying to find out the match is one based on the age of a of a person, but here, uh, the issue is. This equation is really complicated. The line is a zigzag type of line, which is just passing through all the data points. And now, if you have some general data points at the top here again. This is not generalizing the the distribution really well. What might be better is if you have a line like this, so this is. This is a a balance between these two cases that we saw previously, so you can. Have you know only theta 2 into a square? The line will look like a curve and it can generalize your data really. Well, so that tomorrow. If new data point comes in this equation will be able to make a better prediction for you. So the first case is called under fitting. The second case is called overfitting and the third case is balance fit. So you kind of get an idea on over fitting here. Where if you try to run training too much and try to fit too much to your training data set, then you will have issues with, uh, testing data set. When you try to predict your new data points, uh, it might not do a better prediction, so you always have to keep a balance between these two extreme cases. Now now, if you don’t know about all these equations and everything, please refer to my linear regression tutorials. I had in the same playlist. I had post few videos on linear regression, so you should watch that that’s kind of a prerequisite now. How do you reduce overfitting? So here is my overfi’t line along with the equation, and if in this equation, I somehow make sure that my theta 0 and theta 4 is almost close to 0 then I will get an equation like this, so just think about it. If theta 3 and theta 4 is almost close to 0 then you’re kind of ruling out this whole factor, and then you can create this type of equation, so the idea here is to shrink your parameters. Your parameters, which is theta 0 theta 3 theta 4 even theta 2 theta 1 If you can reduce this parameter if you can keep this parameters smaller, then you can get a better equation for your prediction function now. How do we do that we earlier saw in our linear regression video that, um, we calculate mean squared error. So when we run training, we pass first sample and then we calculate y predicted on some randomly initialized weights. Then we compare it with the truth value, and then this is how we call calculate mean square error or m s e here. Y predictor is actually H Theta X. I where s theta X. I could be higher order Polynomial equation like this, okay, and X 1 X 2 is nothing, but it is your feature. So in our case, it will be age of a person. If you are thinking about housing price prediction, it will be the size of the house. Now just think that in your mean square error function. So by the way, this mean square error function, we use during training and we want to minimize the value of this this error. You know, on each iteration, so in this equation. What if I add this particular parameter? So what is this? So there is this lambda, which is a free parameter. You can control it. It’s like a tuning knob, and you are making a square of each of these theta parameters so now. If your theta gets bigger, this value will get bigger. The error will be big. You know, and your model will not converge so essentially what you’re doing Is you are penalizing higher values of theta here, so whenever model tries to make the theta value higher, you are adding a penalty here so by adding this penalty, you’re making sure your theta value doesn’t go too high, so they will remain very small. You can fine tune this thing using this. This parameter lambda here. And if you make this bigger, the theta value will even get smaller and smaller, you know, and if you get this smaller than theta value can be bigger. So this is called l2 regularization. It is called l2 because we are using a square and in l1 regularization you are using the absolute value, so that is the only difference between l1 and l2 that in l1 you are using absolute value of theta parameter here again. If theta is bigger, the error overall error is big and it kind of acts as a penalty so that during your training. Overall, the value of theta will remain smaller and again going back to this equation here when these values remain smaller, you come up with a simpler equation. You know, you don’t make it really complicated and simpler equations are the best, uh, to represent the generic case of your prediction, All right, so let’s get into coding now for the coding. I’m using housing price data set for Melbourne City. I got this data set from kegel and we are going to build a linear regression model, so you can see that there are different features such as room, Uh, the distance, postal code, bathroom car and so on and in my notebook, I am going to as usual first. Import this data set and just call it data. Set, you know, so I imported this into my data frame. Now and data frame is looking good. I’m going to do some exploration of my data frame now and kind of print out Val. Unique values. Um, in data sets, so you see like there are 351 suburbs, these many addresses and so on. Also, if you look at the shape of the data set, there are three four eight fights five seven, uh, records and 21. Columns in total. Now I’m going to discard certain columns. I just did visual observation and discarded certain columns, which I don’t think are very useful. So for example, date, right, so date is it’s not useful, so I will just say, okay. Here are the columns that are useful and I’m just doing copy paste to save the time on recording. And when you, uh, pass these columns to this data set, you get the filter data set, okay, and then again you can run data set dot head so now. I have less columns, okay, so if you do data set dot shape, you know? I have 15 columns instead of 21. Now I want to check for the any values and you can do it by. Calling is any function on your panda’s data frame and you can do dot sum and it will tell you so. Bedroom 2 has total 8217 Any, um, values, you know, any rows. So now we need to handle these rows, so I’m going to fill some of these columns with value 0 okay, and those columns are these so columns to fill with 0 Are this for example? If car is an a, which means you know. There is no car parking available for that particular property. And when you run this function, so in your data set, you’re saying all these columns fill n a with zero. It will take those N values and it will fill them with zero so now. After doing this, I will do this again and you can see that. For example, property count. It was three so property count now is zero. All the na values are filled. Similarly car car parking. There were eight thousand seven hundred rows with any value now. I made zero now. What we’re going to do is well. Look at certain other columns such as land size and building area, and we will like this through, and we will, uh, calculate the mean and we’ll fill those with the mean value, so the way you do that is using this function, so you are doing length size dot, fill N a with the mean of the same column. Okay, so this is a safe assumption, and after that when you do any, you find that see, there are no values with zero. I mean, there are some prizes, but if you look at your independent features there, that’s that’s basically now curated and there are. There are only two columns, region name and this council area. So I’m not going to care about it too much. Now I will, uh, drop those values so like these three. So if you have a couple of random columns with any value, you can drop it. You know, our data set is huge, and if you drop like these three and three six rows, It’s not a big deal, so I will just drop them and run the same function again, and you see, like none of the columns have well. Uh, any values now so now, uh, I have some categorical features, which I want to convert into dummies. You know, I want to do one hot encoding, so you guys might know about one hot end coding, If you don’t, I have a one hot encoding video, so things like suburb for this example, right, any text column that you have you need to do. A dummy Encoding and pandas provide a very convenient API called. Get dummies. So from data set, you get this and you drop first because you want to avoid the dummy variable trap again. If you don’t know about dummy variable trap, you can watch my one hot encoding video. Uh, and this will kind of drop that first column when you do a, uh, encoding. So now when I look at my data set, you see everything. There is no text column, and if you look at this council area, C. Council area underscore so it created a separate column for each of the council area. So this is what, dummy, uh, or hot encoding means. Now I want to create X and Y So my X, so I can. Y is basically price and x. Is you know this one? And by the way I had some prices as N A and when I did drop n a the prices those N values got dropped as well. So my data set looks pretty good and I will do now. Train test split. So train test split. Is this all of these are like standard usual methods. We have covered in previous video, that’s. Why I’m kind of going over it little fast. You are doing a 30 split so 30 test. 70 training. And you get all these different data sets. Now we will use a regular linear regression. Okay, so this is how you do regular linear regression, and then, uh, when you run this, it’s gonna do training and the model is fit. Now now I will do the score on X test and Y test. You realize that my score comes out to be really low 14 percent, which is very, very low score. But if you do a score on training data set say it is 68 percent, so this is clearly over. Fitting your data set. Your data set was so much over fit that for training samples, it gave a good accuracy but for test samples, the data samples that it has not seen before. It gave a horrible score. So how do we address? This sklearn provides a model called lasso so lasso regression is basically l1 regularization. So if you do, skln lasso decoration, see, this is the lasso regression, and it is l1 regularization so I am going to use that model, so I created a model imported that, and then I created a lasso regression object with alpha value. I’m initializing this alpha value to be random. You can play with these values and see which one gives you a better accuracy and I initialize few other parameters as well and when I now run my regression, It is fitting it with the regularization parameter on so if you look at our equation earlier, so let me open our equation in that equation. L one regularization will add an absolute a theta value in your error, and this is the formula that we’ll use during the training simple linear regression without any regularization will not have this red parameter, so that’s the only difference. Now, let’s do a score on your test, so I will do lesser regression score, and you find that the accuracy improved and just to make sure I will also do a score on X test and on training as well, so you see, training and test both are giving very good, not very good. But 67 percent accuracy, which is compared to 13 percent, it improved to 66 percent. You can see how much of a big difference Our regularization can make. There is l2 regularization as well and it’s called ridge regression. So if you do scale on ridge regression, this one is an l2 regularization and here I can import that from a SQL on library. As usual, I create a regression class object, which will look like this and call a fit method on this. And then when I do a score on your test data set, it looks six again. It’s 67 percent, so it’s pretty good and let’s check on training data set training data set is also, you know, it’s pretty pretty good, so you saw that by using ridge regression and lasso regression. So ridge is l2 lasso is one your accuracy for your not seen data samples. Which is your test sample improved a whole lot. If you are trying to learn machine learning, you can just in Youtube. You can just search core basics, machine learning tutorials. I have the complete list of machine learning videos. Here you can go in sequence if you’re watching Regularization video. If you want to click, get a fundamentals clear initially on linear regression as etc. I would suggest you watch tutorial, two and three. I hope you like this video. If you do, please give it a thumbs up. Share it with your friends and thank you very much for watching.