Transcript:

What’s up, guys? Eliot, wait here. And in this video, I want to talk about the sigmoid function. It’s one of the first functions you learn about when learning machine learning, but for me, it wasn’t until later that I really got a deeper understanding and intuition for why we use the sigmoid function and then in this video. I want to try to share that intuition with you. So first! What is the sigmoid function? Well, really, a sigmoid function is just a mathematical function that has this type of s-shaped curve, but in machine learning when we say the sigmoid function, what we’re really talking about? Is this equation here? It’s usually expressed in these two forms. They’re equivalent. You can convert from one to the other, just by multiplying the numerator and denominator by e to the X, and the question is, why do we use this equation instead of any of these alternatives, which similarly look to do the same thing, which is convert numbers on the number line into the range of 0 to 1 here in this diagram, they’re converting them into the range of negative 1 to 1 But if we just shift these up and squash them down a little bit, they can be used equally well as converting numbers into the range of 0 to 1 so for the reason why we use the sigmoid function. I’ll give you two different perspectives. One has to do with the loss function and the gradient we’ll get from the loss function and then the other will be a statistical viewpoint of thinking about the output from our model before we send it through the sigmoid function. What is that output represent so first to look at the loss function will use this desmo’s online graphing calculator. And I’ve mapped these functions over into these colors, and since I don’t have all these colors in Desmos, I’ve mapped brown to black and yellow to orange but other than that. The blue is blue, red, red, purple, purple and green to green so first of all for looking at the tange function, the tangent function is actually the exact same as the sigmoid function is just rescaled and shifted. So when this graph, this green function is the sigmoid function, and if we look at the other functions, they look like they would work similarly just as well mapping numbers on the number line into the output of 0 to 1 but if we zoom out and we look at the loss function for these different input functions and for the loss were just using the standard categorical loss of the negative log of the probability, so we’ll notice a couple of different things. First of all, well notice that his lines are getting cut off and that this orange ones looking a bit weird, this black one starting to get dotted and I’ll explain that in a second, but for now, I’m just going to turn off the orange and the black I’ll turn off. There’s other ones down here as well and we immediately see that our graph gets a lot cleaner and we can really zoom out so major difference between our sigmoid function here and these blue, red and purple is that the loss function as we get farther and farther to the left for our sigmoid function, it stays consistent. We’ll have a consistent gradient on this to move to the right, but on these other functions, the gradient will get less and less and that’s the slope of this function and the farther we get over here. The harder will be for our outputs to correct and our model to learn. And if we zoom in to the right, the loss, kind of goes towards zero and it flattens out and that’s okay, because that’s what we expect as our loss goes to zero. We’re already doing the correct thing with that output, so we don’t need much of a gradient, so this saturation aspect of these functions on the right side are one of the main reasons why we don’t use them, but now let’s return to these other potential functions on the Left. I don’t actually know much about the computational needs of these functions, but it’s pretty safe to say they are much more difficult to calculate than just the sigmoid function and we can already see. This desmo’s calculator is having a hard time calculating them. It can’t really go out that far on this black line and this orange line is getting a bunch of airs all over the place. So when we’re choosing a function for our neural networks, the easier it is to compute that function, the better choice, it’s going to be, so that’s kind of a reason why we wouldn’t want to use these orange and black functions, so that’s the practical side of it. Why we use the sigmoid function given the derivatives of the loss function but now? I want to look at the statistical side. I’ll start by showing you an example of where we see the sigmoid function in nature. So speaking of nature, let’s imagine we are out in nature. We’re in a forest and it’s getting foggy and the fog is starting to condense on these branches and they’re starting to drip and way up in the air. There are two branches right next to each other, and they’re both dripping drips of water randomly and as a drop falls through the air, it hits all the different air molecules on the way down there might be some slight wind left and right, and when all these bumps left and right, add up the probability distribution of where that drop will hit, the ground can be modeled pretty closely by a normal distribution. And that’s what we have here in this graph, and if we look at where the drops would fall for the branch just to the left of it, it could also be modeled by a normal distribution, and these graphs are just looking at how far left and right the job. Falls and we’re going to ignore the depth of where Falls since resuming the branches are directly left and right of each other, so the depth won’t give us any more information of which drop came from which branch, so let’s say we’re looking at the ground and we see a drop fall right in between these two means of these distributions, and if we ask, what’s the probability that that drop came from the right branch, well, it’s going to be 50% because both of these normal distributions have the same probability of dropping a drop right in the center. But if we look a little bit to the right, you know, pull these a little bit closer for this demonstration. If you look a little bit to the right where the blue is twice as tall as the green. And we say, what’s the likelihood of a drop falling here then we would take the height of this blue line and divide it by the height of the sum of the height of these two lines. And that will give us two-thirds so if a drop. Falls right here. They’ll have a two-third’s chance of having come from the right branch and a one-third chance of having come from the left branch, and if we do that all along this number line, taking the height of the blue line and dividing it by the height of the sum of the two probabilities at that point, we get lo and behold the sigmoid function, so we can look at how it transforms as we move. The means of the distributions farther and closer together, and if we change, the variance has a similar squashing effect, but nonetheless, it’s still the shape of a sigmoid function, so if we are saying that the output from our model before we send it through the sigmoid function is going to be somewhat to the right if it’s a cat image and somewhat to the left, if it’s a dog image, but we’re not completely sure that’s where it’s going to land and it’s going to be somewhat normally distributed then to say what the probability we think it is a cat image versus a dog image. We would use a sigmoid function, Also note 4 the output shape to be this sigmoid shape. The variance of both of the normal distributions has to be the same, If, for example, one of the normal distributions has a greater variance than the other. Then it will dominate in the tails of the distribution. Like this as you can see the curve changing, and this is a sensible thing to do because we probably don’t want to assume the outputs for one of our classes will have a greater variance than the outputs for another class. So if you want the mathematical derivation of how we prove this relationship between normal distributions and the sigmoid function. I’ve included it on this page and I’ll link to it in the description. If you want to go through it or you can just pause the video and look at these steps, but when we get down to the end, we have this value can be replaced by an a squashing value this value can be replaced by B, the shifting value. And as we move, the means around, it changes the shifting it changes this squashing but this is essentially the sigmoid function with some squashing and shifting. So I’ll put a link to this graph if you want to play around with it and hopefully this video gave you a deeper intuition into why we use the sigmoid function instead of the other potential s-shaped functions. And that’s all for this one. I’ll see you guys next time [Music]! Oh, [Music].