Transcript:

[MUSIC] In this video, we’ll be discussing batch normalization, otherwise known as batch norm and how it applies to training and artificial neural network will then see how to implement batch norm and code with Kerris. Before getting to the details about batch normalization, let’s quickly first discuss regular normalization techniques, generally speaking when training a neural network we want to normalize or standardize our data in some way ahead of time as part of the pre-processing step. This is a step where we prepare our data to get it ready for training normalization and standardization both have the same objective of transforming the data to put all the data points on the same scale. A typical normalization process consists of scaling the numerical data down to be on a scale from zero to one in a typical standardization process consists of subtracting the mean of the data set from each data point and then dividing the difference by the data sets standard deviation. This forces the standardized data to take on a mean of zero and a standard deviation of one in practice. This standardization process is often just referred to as normalization as well in general, though this all boils down to putting our data on some type of known or standard scale. So why do we do this well? If we didn’t normalize our data in some way, you can imagine that we may have some numerical data points in our data set that might be very high and others that might be very low, For example, say we have data on the number of miles individuals have driven a car over the last five years. Then we may have someone who’s driven a hundred thousand miles total, and we may have someone else who’s only driven a thousand miles total. This data has a relatively wide range and isn’t necessarily on the same scale. Additionally, each one of the features for each of our samples could vary widely, as well if we have one feature, which corresponds to an individual’s age, and then another feature corresponding to the number of miles that that individual has driven a car over the last five years. Then again, we see that these two pieces of data age and miles driven will not be on the same scale. The larger data points in these non-normalized datasets can cause instability in neural networks because the relatively large inputs can cascade down through the layers in the network, which may cause imbalance gradients, which may therefore cause the famous exploding gradient problem. We may cover this particular problem in another video, but for now understand that this imbalanced non-normalized data may cause problems with our network that make it drastically harder to Train. Additionally, non-normalized data can significantly decrease our training speed when we normalize our inputs. However, we put all of our data on the same scale and attempts to increase training speed as well as avoid the problem we just discussed because we won’t have this relatively wide range between data points any longer once we’ve normalized the data. Okay, so this is good, but there’s another problem that can arise even with normalized data so from our previous video on how a neural network learns we know how the weights in our model become updated over each epoch during training via the process of stochastic gradient descent or SGD. So what if during training, one of the weights ends up becoming drastically larger than the other weights? Well, this large weight will then cause the output from its corresponding neuron to be extremely large, and this imbalance will again continue to cascade through the neural network causing instability. This is where batch normalization comes into play batch. Norm is applied to layers that you choose to apply it to within your network. When applying batch norm to a layer the first thing the batch norm does is normalize the output from the activation function recall from our video on activation functions that the output from a layer is passed to an activation function, which transforms the output in some way, depending on the function itself before being passed to the next layer as input after normalizing the output from the activation function bash. Norm then multiplies this normalized output by some arbitrary parameter and then adds another arbitrary parameter to this resulting product. This calculation, with the two arbitrary parameters sets a new standard deviation and mean for the data, These four parameters, consisting of the mean the standard deviation and the two arbitrarily set parameters are all trainable, meaning that they, too will become optimized during the training process. This process makes it so that the weights within the network don’t become imbalance with extremely high or low values since the normalization is included in the gradient process. This addition of batch norm to our model can greatly increase the speed in which training occurs and reduce the ability of outlying large weights that will over influence the training process, so when we spoke earlier about normalizing our input data in the pre-processing step before training occurs. We understand that this normalization happens to the data before being passed to the input layer. Now with batch norm, we can normalize the output data from the activation functions for individual layers with our model as well, so we have normalized data coming in, and we also have normalized data within the model itself. Now everything we just mentioned about the batch Normalization process occurs on a per batch basis. Hence the name batch norm. These batches are determined by the batch size. You set when you train your model. So if you’re not yet familiar with training batches or batch size check out my video that covers this topic so now that we have an understanding of batch norm, let’s look at how we can add batch Norm to a model and code using Kerris. So I’m here in my Jupiter notebook, and I’ve just copied the code for a model that we’ve built in a previous video, so we have a model with two hidden layers with 16 and 32 nodes, respectively, both using rel you as their activation functions and then an output layer with to output categories using the softmax activation function. The only difference here is this line between the last hidden layer and the output layer. This is how you specify batch normalization and Cari’s following the layer for which you want. The activation output normalized. You specify a batch normalization layer, which is what we have here to do this. You first need to import batch normalization from Charis as shown in this cell. Now, the only parameter that I’m specifying here is the axis parameter and that’s just to specify the axis for the data that should be normalized, which is typically the features axis. There are several other parameters. You can optionally specify, including two called beta initializer and gamma initializer. These are the initializers for the arbitrarily set parameters that we mentioned when we were describing. How batch Norm works. These are set by default to zero and one by Kerris. But you can optionally change these and set them here, along with several other optionally specified parameters as well and that’s really all There is to it for implementing batch Norm and Cari’s. So I hope, in addition to this implementation that you also now understand what batch Norm is how it works and why it makes sense to apply it to a neural network. And I hope you found this video helpful. If you did please like the video, subscribe, suggest and comment and thanks for watching [Music].