Transcript:
You got this believe in yourself. Kevin, easy, easy! Almost there talk about Optimizers optimizers define how neural networks learn they find the values of parameters such that a loss function is at its lowest. Keep in mind that these optimizers don’t know the terrain of the loss, so they need to find the bottom of a canyon when line folded. Essentially, let’s start with the one, the only gradient descent hop hippity. Hop hop! Wait too far, huh? Too far again. Oh, come on, the original. Optimizer gradient descent involves taking small steps iteratively until we reach the correct weights. Theta, the problem here is the weight is only updated once after seeing the entire data set, so this gradient is typically large theta can only make larger jumps, and it may just hover over its optimal value without actually being able to reach it. The solution to this update. The parameters more frequently, like in the case of stochastic gradient descent, Stochastic gradient descent, updates, the weights after seeing each data point instead of the entire data set. But there’s a problem here, too. You see, wait. That example was weird. No, okay, easy! Wait, wait easy! Nope, nope, no, no, no. Oh, hell, no! This may make very noisy jumps that go away from the optimal values it’s influenced by every single sample because of this. We use mini-batch gradient descent as a compromise, updating the parameters Only after a few samples, huh? Another way to decrease the noise of stochastic gradient descent is to add the concept of momentum. The parameters of a model may have a tendency to change in one direction, typically if examples follow a similar pattern with this momentum, the model can learn faster by paying little attention to the few examples that throw it off time to time, but you might see a problem here. Bigger, bigger, bigger. Do they choosing to blindly ignore samples simply because it isn’t typical, it may be a costly mistake, reflecting in our laws, adding an acceleration term, though helps your model is training gaining momentum. The weights are becoming larger. It finds an odd sample because of momentum, it thinks very little of it, though, but discarding it leads to a loss decrease that wasn’t as drastic as you thought. This is where we decelerate our weight updates. The weight updates become smaller again, allowing future samples to fine-tune the current model. We go big or we go home way will meet the lawsuit decrease as much. They thought it would slow down. Haha, not too shabby. But this is the loss function for a single predictor using multiple predictors that the learning rate is fixed for every parameter autograph allows an adaptive learning rate for every parameter. I’m on a 3d surface plot. Iran octomorg cool site to plot out equations. This is a plot of Z is equal to X Square minus Y square. Z is the value of the loss, and this loss has a minimum value of y tending towards negative or positive infinity. If I were to start somewhere up here on the saddle point, My OPTIMIZER would go down in one direction of the Y Axis, like how my cursor is moving with an adaptive loss. I have more degrees of freedom to increase my learning rate in the Y direction and decrease it along the X direction. In fact, this is what we see here. Adaptive learning rate optimizers are able to learn more along one direction than another. Hence they can traverse this kind of terrain in the optimizer. Update the capital. Gtii is the sum of squares of the gradients with respect to theta. I parameter until that point. The problem with this is that the G term is monotonically increasing over iterations, so the learning rate will decay to a point where the parameter will no longer update. And there’s no learning we can actually see this effect here. For the outer grad point, as the iterations go on, it learns slower and slower, even though the optimal trajectory is quite clear, add a delta to the rescue. It reduces the influence of past squared gradients by introducing a gamma weight to all of those gradients. This reduces their effect by an exponential factor, so the denominator doesn’t explode and this prevents the learning rate from tanking to zero cool, so we actually have learning rate updates for every single parameter. Well, if this is the case, why not just go even further and have momentum updates for every parameter? And this is what Adam does the only change you need to make from out of Delta to. Adam is just add the expected value of past gradients. What does it mean, it means that we are slow initially, but pick up speed over time and this is intuitively similar to momentum as you build up momentum over time in this way. Adam can take different size steps for different parameters and with momentum for every parameter, it can also lead to faster convergence because of its speed and accuracy. I think you can see why Adam can be used. As a De-facto Optimizer for many projects, of course, we can go even further introducing acceleration in Adam Natum. And I could go on. It might seem like a ton of optimizers are out there and there are, but we’ve literally just added a term to each algorithm gradually making them capable of more things, but with all of these optimizers, which is the best one. Well, that depends on the kind of problem that you’re trying to solve instant segmentation, semantic analysis, machine translation image generation, so many problems out there with different types of losses. The best algorithm is the one that can traverse the loss for that problem pretty well, it’s more empirical than mathematical. I hope this video helps you. Better understand the role of these optimizers and clear some things up too. If you liked the video hit that, like, click, subscribe and also watch some of my other videos on the channel. You won’t regret it. Take care!