Transcript:

Hi, everybody, welcome to a new machine. Learning from scratch tutorial. Today, we are going to implement the SVM algorithm, using only build and Python modules and numpy, the SVM or support vector machine is a very popular algorithm. It follows the idea to use a linear model and to find a linear decision boundary also called a hyperplane that best separates our data and here the choice as the best Hyperplane is the one that represents the largest separation or the largest march in between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. So if you have a look at this image, then we want to find a hyperplane and the hyperplane has to satisfy this equation. W Times X minus B equals 0 and we want to find the hyperplane so that the distance to both the set to both classes is maximized. So we use the class plus 1 here and minus 1 here, so this distance or the margin should be maximized and first let’s have a look at the math behind it, so it’s a little bit more complex than in my previous tutorials, But I promise that once you have understood it. The final implementation is fairly simple, so we used the linear model. W Times X minus B That should be 0 and then our our function should also satisfy the condition that W Times X minus B should be greater or equal than 1 for our class plus 1 so all the samples here must lie on the left side of this equation or this line here and all the samples of the class. – one must lie on the right side from this equation. So if we put this mathematically, then we should, it must satisfy. W Times X minus B should be greater or equal than one for class one, or it should be less or equal than minus one for class minus one. So if you put this in only one equation, then we multiply our linear function with the class label and this should be greater or equal than one. So this is the condition that we want to satisfy, and now we want to come up with the W and the B So our weights and the bias and for this, we use the cost function and then apply gradient descent. So if you’re not familiar with gradient decent already, then please watch one of my previous tutorials, for example, the one with linear regression there. I explained this a little bit more in detail, so now let’s continue, so we use the user cost function here. And in this case, we use the hinge loss and this is defined as the maximum of zero and one – and here. We have our condition. Y I times our linear model. So what this means is if if we plot the hinge loss then here, the blue line is the hinge loss, so this is either 0 if Y times F is greater or equal than 1 so if they have the the same sign, then it’s 0 and so if they, yeah, if they are correctly classified and are larger than 1 then our loss is zero, so this means if we have a look at this image again if for the green class, if it’s if it lies on this side, then it’s 0 and for the blue class if it lies on this side, then it’s also 0 and otherwise, and then we have a linear function, so the further we are away from our decision boundary line. The higher is our loss, and so this is one part of our cost function and the other part is as I already said we want to maximize the margin here so between these two classes and the margin is defined is 2 over the magnitude of W. So this is dependent from our weight dependent on our weight vector, so we want to maximize this, and therefore we want to minimize the magnitude, so we put this or add this to our cost function, so we also put this term the magnitude of W to the power of 2 times, a lambda parameter, and then here we have our hinge loss, so the Lambda parameter tries to find a trade-off between these two terms so with it says, basically says, which is more important, so we want to. Of course we want to have the right classification. We want to lie on the correct side of our lines, but we also want to have the the line such that the margin is is maximized. So, yeah, so if you look at the two cases, if our if we are on the right side of the lines of why? I times F on X F of X is greater or equal than one. Then we simply we only have this term because this is the hinge loss is 0 and otherwise, then our cost function is this year, and now we want to minimize that, so we want to get the derivatives or the gradients of our cost function, so in the first case, if we are greater or equal than 1 our derivative is only is 2 times Lambda times. W so and here we only look at one component of our W. So we get rid of the magnitude and the derivative with respect to the B is 0 So please double check that for yourself here. I will not explain the derivatives in details and in the other case, So if if Y I times F on X is not greater or equal than 1 then our derivative with respect to the W is this equation here and the derivative, with respect to our bias is only Y I so again. Please double check that for yourself. And then when we have our gradients, we can use the update rule. So the new weight is the old weight – because we use gradient descent, so we go into negative direction, – the learning rate or the step size times the derivative. So these are our update rules and now. I hope you’ve understood the concept and the math behind this. And now we can start implementing it, so this is now straightforward. First of all we import numpy. S&P, of course. And then we create our class as we M which will get an init method and here I will put in a learning rate, which will get a default value of point zero zero one, and it will get a lambda parameter, which will also get a default. And I will say this is point zero one. So this is usually also a small value and then it will get the number of iterations for our optimization, which will get the default of one thousand. So then I will simply store them, so I will say self DOT L. R equals learning rate self. Dot Lambda Param equals lambda Param so note that I cannot use lambda here because Lambda is a key word in Python for the Lambda Function. So ya then self dot and Iters equals and ITERs. Then I will say self DOT W equals nun and self dot B equals nun, So I have to come up with them later, and then we define our two functions so as always, one is the predict function where we fit the training samples and the training labels and the sorry this is the fit method and the other one is the predict method where we predict the labels of the test samples and now let’s start with the predict method Because this is very short, So we want to, as I said. If we look at the math, we apply this linear model, and then we look at the sign of this, so if it’s positive, then we say it’s class one, and if it’s negative, then we say it’s class minus one. So we say linear output equals numpy dot dot, so the dot product of X and self dot W minus self dot B. And then we choose the sign, so we can simply say return Numpy dot sine of this linear output, so this is the whole predict implementation and now let’s continue with the fit method So first of all as I said, we used the classes plus 1 and minus 1 here, so we want to make sure that our. Y has only minus 1 and plus 1 so oftentimes, it has 0 and 1 so let’s convert this, so let’s say. Y underscore equals and here we can use Numpy Dot where this will get a condition, so we say Y, and if this is less or equal than 0 then we put in minus 1 and otherwise we put in plus 1 so this will convert all the zeros or smaller numbers to minus 1 and the other numbers 2 plus 1 and now let’s get the number of samples and the number of features and this is simply X dot shape because our input Vector X is in numpy and D Array, where the number of rows is the number of samples and the number of columns is the none features. Then we want to initialize our W and our B and we simply put in zeros in the beginning, so we say self DOT W equals numpy zeros of size and features so for each feature component we put in a zero for our weight component, and then we say self DOT B equals zero, and now we can start with our gradient descent, so we say for underscore because we don’t need this in range self dot and it error, so the number of iterations we want to do this, and then we iterate over our training samples, so I say for index and X I in enumerate X, So this will give me the current index and also the current sample. And now what I want to do now is lets. Have a look at the math again, so I want to come. I want to calculate the weight or the derivative of our cost function with respect to the W and with respect to the bias And here I first, but at first, I look if this condition is satisfied, so I will say, and the condition is why I times our linear function, so I say condition equals y underscore of the current index times and then the linear function so numpy dot of the current sample and our self. W minus self dot be. This should be greater or equal than one, so if this is satisfied and the condition is true and otherwise, it’s false. So now I say if condition, so if this is true, then our derivatives look like this. So the derivative, with respect to the B is just zero, and so we only need this, so I say, so it’s two times Lambda Times W and then in our update we go in as a we say, the new weight is the old way – the learning rate times this, so I write this in one step, so I say self DOT W – equal self dot learning rate times, and now here we have two times self dot Lambda parameter times self DOT W. So this is the first update or if our condition is satisfied and we only need this update and otherwise we say self dot. W – equals self times L our learning rate times and lets again have a look at the equation, so it’s 2 times Lambda times. W minus y I times X. I so 2 times our lambda times the W – numpy dot. So I want to multiply our vectors X. I and y I so the Y underscore of the current index. So this is our update for the W. And our self dot B is minus equal self times learning rate times the derivative and the derivative is only or just Y I so only Y underscore of the index and now we’re done, so this is the whole implementation, and now let’s test this, so I’ve written a little test script that will import this SVM class, and then it will generate a some test samples so it will generate two classes. And then I will create my SVM classifier and fit the data, and then I wrote a little function to visualize this so you can find the code on Github by the way, so please check that out for yourself, and now if we run this, so let’s say Python as we am underscore test of time and now this should calculate the weights and the bias and it should also plop the decision function so that yellow line and the two lines on both sides here and we see that it’s working so yeah, that’s all about the SVM. I hope you enjoyed this and if you liked this. Please subscribe to my channel and see you next time bye.