Transcript:

Momentum algorithm or gradient descent with momentum is more common than general gradient descent. Almost always runs faster. The basic idea is to calculate the exponential weighted average of the slope. Update the weight with that value. In this video, let’s unravel a sentence and see how it can be implemented. Let’s say you’re optimizing the cost function for most examples. The contour line is:. The red dot indicates the location of the minimum. The gradient descent starts here. If you take one iteration of the gradient descent or the mini-batch gradient descent, you’re heading this way. If we take one step of gradient descent on the other side of the ellipse, It comes like this. Every time you keep going step by step, it goes like this. If you take many steps, it will vibrate slowly as it moves to the minimum. These oscillations up and down, slow down the gradient descent. Prevents using a larger learning rate Because you can overshoot and emit something like this. Therefore, the learning rate should not be too large to prevent the vibration from increasing. Another view of this problem is On the vertical axis. We want the learning to happen more slowly to prevent vibration, but I want to learn faster on the horizontal axis. Because we want to handle the left-to-right shift towards the minimum, Therefore, in the gradient descent method using momentum, the implementation is as follows:. In each iteration, more precisely in iteration T, We will calculate the common derivatives, DW and DB. We will calculate the common derivatives, DW and DB. I will omit the parentheses as superscript’s. However, it will calculate the DW and DB for the current mini-batch. When using batch gradient descent, the current mini-batch is the same as the entire batch. Even if the current mini-batch is the same as the full training set, it will work fine. The next thing to do is V_dw. Compute βV_dw + (1-β)dw. This was calculated before Similar to V_θ = Β*v_θ + (1-β)θ_t. Calculate the moving average as the derivative of w. V_db in a similar way. Compute Βv_db + (1-β)db. So we use w to update the weights. W is updated to W minus. The learning rate Times V_dw W is updated to W minus the learning rate Times V_dw B, similarly b-updated to α*v_db b-updated to Α*v_db. This makes the gradient descent. Step smoother. Some of the derivatives we calculated last time. If you say something like this. If we get the average of this slope, Vertical vibrations are averaged close to zero. In the vertical direction, you want to slow down. Since we average the positive and negative numbers, the average is zero. On the other hand in the horizontal direction, all derivatives point to the right. The average in the horizontal direction is quite large. So in this algorithm, with a few iterations Gradient descent in the end, There is a much smaller vibration in the vertical direction. You will find it moves faster in the horizontal direction. So this algorithm allows you to go a more straight path or reduce vibration. The intuition you can get from this momentum. If you try to minimize the bowl-shaped function If you try to minimize the bowl-shaped function, Then the terms of this derivative Can be seen as providing acceleration when going down And these momentum terms can be seen as representing speed. So when the small ball goes down the slope of this bowl, the derivative gives it acceleration and So when the small ball goes down the slope of this bowl, the derivative gives it acceleration and Make it go down faster. Because it gives you acceleration. And since the value of β is a little less than 1 Provides friction to prevent the ball from speeding up without limit So instead of taking all previous steps independently, It can give acceleration and momentum to the ball Going down the bowl. The analogy of a ball rolling a bowl is. I think it works well for people who like physical intuition. But if the analogy of the ball going down, the bowl is difficult to understand. Don’’t worry. Now let’s take a look at the details of how to implement it. There are two hyperparameters here that control the learning rate Α and the exponential weighted average. The most common value for Β is 0.9 Averages the temperatures over the last 10 days. In fact, it works very well if Β is 0.9 Explore hyperparameters while trying different values, But 0.9 gives a pretty solid value. What about bias correction? Divide v_dw by (1-β^t). Many people don’’t use it well. The reason is that after 10 repetitions. This is because the moving average has been sufficiently advanced and the bias estimation no longer occurs. Therefore, few people do deflection correction when implementing gradient descent or momentum. The process of initializing v_dw to 0 is. It is a matrix of zeros with the same dimension as DW is the same dimension as W is the same dimension as W. V_db is also initialized to a vector of 0 is the same dimension as DB and B. If you read the paper on gradient descent with momentum, You will find that the term for (1-β) is often deleted therefore. So V_dw = Β*v_dw + dw? So V_dw = Β*v_dw + dw? So V_dw = Β*v_dw + dw? The effect of the purple version is v_dw is scaled by a factor of 1/(1-β). So when performing gradient descent update, α needs to be changed to the value, corresponding to 1/(1-β). In practice, both will work fine Will only affect the most optimal value for the learning rate α. But I think this particular expression is less intuitive. One effect of this is by correcting the value of the hyperparameter β. It will affect the scaling of V_dw and V_db and the learning rate will need to be recalibrated. So I personally prefer the formula. I wrote on the left. The 1-β term is a living formula. However, setting β to 0.9 is a common hyperparameter choice for both settings. The difference between these two versions is that the learning rate. Α is calibrated differently. So it was gradient descent with momentum, Almost always works better than gradient descent without momentum. However, there remains another way to speed up the learning algorithm. Let’’s see it together in the next video.