Transcript:

Deep learning researchers, including those well-known in the history of deep learning, Occasionally suggests optimization algorithms and proves. It works well for a few problems. However, such an optimization algorithm can be applied to a wide range of neural networks. Generally showed that it doesn’t work well, Therefore. The deep learning community has some doubts about the new optimization algorithm. Because gradient descent with momentum works very well. It is difficult to suggest an algorithm that works better. So today’s RMSprop and Adam optimization algorithms. An algorithm that works well in a wide range of deep learning architectures Stood tall, Therefore. It is an algorithm that you do not have to hesitate to try. Because a lot of people have tried it and have seen it work well, for a lot of problems. Adam Optimization algorithm is an algorithm that combines RMSprop and momentum. Let’s see how it works. To implement Adam, we initialize V_dw and S_dw to 0 v_db and S_DB are also initialized to 0 And in repeat T And in repeat T Derivatives, DW and DB Calculate using the current mini-batch. We mainly use the mini-batch gradient descent method And calculate the momentum exponential weighted average. So V_dw is To distinguish it from the hyperparameters of Rmsprop. I’’ll, call it β_1. So this is This is the expression used to implement momentum. The only difference is that Β_1 is used as the hyperparameter instead of β. Similarly, V_db i’s: Similarly, V_db is: (1-β_1)*db Updates from Rmsprop are:. This time using the hyperparameter β_2. The square here is the element-wise square of DW. The square here is the Element-wise Square of DW S_DB = Β_2*s_db + (1-β_2)db. So this is a momentum update using hyperparameter Β_1. So this is a momentum update using hyperparameter Β_1. This is the Rmsprop update using hyperparameter Β_2. This is the Rmsprop update using hyperparameter Β_2. In a typical Adam implementation, we do a bias. Correction v_dw^corrected means deflection Correction. This value is equal to v_dw/(1-β_1^t). Similarly, v_db^corrected is Similarly v_db^corrected is Same as v_db/(1–β_1^t) and Similarly. The bias correction for s_dw is also. Same as s_dw/(1–β_2^t), s_db^corrected is s_db^corrected is Same as s_db/(1–β_2^t). Same as s_db/(1–β_2^t). Finally run the update. W is W-α times. Because we are implementing momentum v_dw^corrected and Since we are adding RMSprop partially divide it by the square root of s_dw^corrected +ε. Since we are adding Rmsprop partially divide it by the square root of s_dw^corrected +ε B in a similar way B-α Times V_db^corrected Square root of S_db^corrected + Ε Divide by square root of S_db^corrected + Ε, Therefore. This algorithm is the effect of gradient descent with momentum and Combines the effects of gradient descent with RMSprop. This works well for different neural networks with a very wide range of architectures. It is a proven and commonly used learning algorithm. So this algorithm has many hyperparameters. The learning rate hyperparameter α is It’s very important and needs to be calibrated. So you have to try different values. to find the one that fits well. 0.9 is usually chosen as the default value for Β_1. This is the moving average weighted average of DW. This is a term for momentum. The hyperparameter for Β_2 is the value recommended by the author in the Adam Paper. Is 0.999 This is the calculated moving weighted average of DW^2 and DB^2. And ε. This value doesn’’t really matter. According to the author of the Adam Paper, 10^(-8) is recommended. However, not setting this value does not affect overall performance. However, when implementing Adam people usually use default values, of, Β_1 and Β_2 and Ε as well. I’ve never seen anyone calibrate the value of ε. Usually try multiple values for α to find the one that works best Β_1 and Β_2 can also be corrected, but not often. Where does the term Adam come from? Adam is a term from adaptive moment estimation. Adam is a term from adaptive moment estimation. Since Β_1 Computes the mean of the derivative, this is the first moment is the second moment because β_2 computes the square of the exponential weighted mean. All just call it Adam Optimization algorithm. This is my old friend and colleague Adam Coates. Although he uses this algorithm, it has nothing to do with him. I’’m telling you that this is an occasional question. So it was ADAM Optimization algorithm. You will be able to train your neural network faster. Let’s deal more intuition with hyperparameter calibration and optimization problems. In the next video, I’’m going to talk about the reduction in the learning rate.