Transcript:

Dropout was a strange technique for randomly deleting units in a neural network. Why does this work well with normalization? Let’s find out why. In the previous video, we talked about the intuition of randomly deleting units in a neural network. So every iteration you’re working on a smaller neural network Using a smaller neural network seems to have the effect of normalization. Here’s the second intuition. Let’s look at it from the point of view of a single unit. What this unit has to do is take an input and produce a meaningful output. Inputs can be randomly deleted via dropout. In some cases, these two units are deleted in some cases. The other units are deleted. So the units I marked in purple? Can’t depend on any trait. The trait changes randomly or Because the trait’’s unique input can change randomly. So it’’s a situation where you can’’t bet everything on this particular input. In other words, it’’s a situation where you are reluctant to give an unusually large weight to a specific input. So it’s better to distribute the weights across each of these four inputs. By distributing the weights. The squared value of the norm of the weight is reduced. As we saw in L2 normalization, The effect of implementing the dropout is to reduce the weight. Helps to prevent overfitting like L2 regularization Dropout was previously seen as an adaptation of L2 normalization. However, different weights are treated differently in L2 Regularization Depends on the size of the activation multiplied by its weight. In summary dropout can have a similar effect to L2 regularization. L2 regularization is applied to different weights and for inputs of different sizes. The only difference is that they adapt better. Let me tell you one more detail. When implementing dropout Here are the three input characteristics and There are networks with 7, 7, 3, 2 1 hidden units. One of the parameters we have to choose is keep_prob. Chances of keeping that unit on each floor. It is also possible to change keep_prob for each layer. The first layer’s weight Matrix w^[1] is the (3 7) matrix. The second weight Matrix is the (7 7) matrix. W^[3] is the (7 3) matrix. So it goes on like this. It is the largest weight matrix because w^[2] has the most parameters as (7 7) So to reduce the overfitting of this matrix? Layer 2 should have a relatively low keep_prob. I will make it 0.5 On the other hand, you can set a higher keep_prob for floors with less risk of overfitting. With 0.7, etc For layers that are not concerned about overfitting. You can set keep_prob to 1 The numbers I hit the Purple box. Are the values of keep_prob for different layers Keep_prob with a value of 1.0 This means keeping all units and not using dropouts on that floor, However, the layer with many parameters that is the layer with many concerns about overfitting. Set keep_prob small for more robust dropouts. In the L2 normalization, the parameter λ is set in the layer that needs more normalization than the other layers. Is something like increasing? Theoretically, we can apply the dropout to the input layer as well. You can also delete one or more input features. However, it’s better not to use this really often. For the input layer, a keep_prob of 1.0 is the most common value. Sometimes we use a value of 0.9 Because I don’t want to delete more than half of the input features, So we use Keep_prob close to 1 If you apply dropouts to the input layer as well. In summary for the floors that are more concerned about overfitting than other floors, You can set keep_prob with a lower value than other layers. The downside is that you have more hyperparameters for cross-validation. Another alternative is. Some layers have a dropout and some layers don’t. We only have one keep_prob for the layer to which we applied the dropout parameter. Before we wrap up here are some implementation tips:. Many of the first successes have come about implementing Dropout in the field of computer vision. Because computer vision uses all of the so many pixel values, Data is lacking in most cases. So dropouts are very frequently used in computer vision. Researchers in the field of vision today almost always use dropout by default. But what to remember here Dropout is a regularization technique and helps prevent overfitting. So until my network has an overfitting problem. I’’m not going to use dropouts, even if they are often used in some application areas. Because computer vision doesn’’t have enough data. That’s why almost all overfitting happens and dropouts are used a lot. The intuitions I’m talking about. Don’t always generalize to other fields as well. The big downside of dropout is Is that the cost function. J is no longer well defined. At every iteration, we will randomly delete a bunch of nodes Therefore. If you double check the performance of gradient descent, It becomes difficult to see if the well-defined cost function J falls at every iteration. The cost function you are optimizing is not well defined and difficult to calculate. Therefore, it becomes difficult to debug by checking If it looks like this. Therefore, it becomes difficult to debug by checking If it looks like this. I usually set Keep_prob to 1 to stop the dropout effect. Run the code to see if J is monotonically decreasing And give the dropout effect again. Make sure you don’t change code when there is a dropout. Because the code and gradient descent work. Well, when there is a dropout, This is because we need other methods than just looking at the function and checking it. In addition to the techniques discussed so far, there are a few more normalization techniques. See you in the next video.