[MUSIC] This video will explain the squeeze net architecture squeeze net is a network that was designed to be small and storage size. This resulted in the fifty times fewer parameter model and less than half a megabyte model size. The motivation for smaller convolutional neural networks is that they require less communication across servers during distributed training, less bandwidth to explore new models and it’s more feasible to deploy at hardware with limited memory distributed training is the most popular way to train deep neural networks, especially when you’re trending on enormous datasets or usually when you’re training enormous models. Even in this case, though they animate argument for using distributed training, even with small models and the way that distributed training goes also commonly referred to as distributed synchronous or Asynchronous Stochastic gradient descent is that there is a server that holds the model parameters and then it distributes the parameters to the machines to run a stochastic gradient descent training batch update, and then it’ll send these updates back to the model parameter, and, you know, in this way, it distributes the learning between the different machines, so another thing is with things like self-driving cars like Tesla would want to be able to update their computer vision models and then send new models to the car. And if you have a smaller model, it’s a quicker communication from server to car. So this is similar idea to with embedded systems where things like field programmable gate arrays have less than 10 megabytes of on-chip memory. So therefore you can put your 240 megabyte Alex net model onto these kinds of devices, so these devices aren’t used for training deep neural networks, but they’re used for inference for like making predictions, so you don’t need to train them on these devices just about storing them for the applications like I don’t know, like a smart camera and an Iot system. So squeeze net achieves Alec’s net level image net accuracy with only 0.5 megabytes and this is done through this high-level design strategy. They want to replace the 3 by 3 filters with one filters because they have nine times fewer parameters, then when decrease the input channels to the three by three filters, and they want to down sample eight in the network such that they reason, the convolutional networks have more semantic information with the spatially larger activation maps. So when you have your feature maps in convolutional networks, they want the height width of each feature map to be larger early in the network, so they explicitly do this with this fire module And this is the key idea. This is the key layer introduced in the squeeze net paper. So what they do, is they? Squeeze the features with this squeezed layer, consisting of one by one convolutional layers, and then they expand it with a combination of one by one and three by three filters. So the feature map would be small and then bigger after the from squeeze to expand so the fire module is made up of these three parameters. The number of convolutional filters used in the squeeze and then the expand one by ones and the expand three by three. So in this diagram here, there are three one-by-one compositions in the squeeze layer, and then there is four one by ones and four three by threes in the expand. Just so you can connect the parameters with the picture shown so and then just another like constraint on it is that the squeeze is usually much less than the expand in terms of the number of filters, so another thing that they have to do is they 0 pad the input data to 3 by 3 filters so that they have the same height and width between the output eyes. Because if you remember with a convolution if you have a 32 by 32 feature map and then you do a 3 by 3 convolution over it, you’re gonna have a 30 by 30 output feature map, so they do pad the border such that it has the same height width resolution as the 1 by 1 filters and this way, they can just concatenate the outputs from all the expand layers along the channel access or the feature map access, so the idea that they do Two is delayed down sampling and this is done with Max Pooling after the first convolution, the fourth and eighth fire module and then the tenth, so it’ll be clear when we see the for architecture diagram. So this is what the full architecture looks. You taking an input image convolve over at Max Pool and then have these fire modules and then the one shown in the middle and the one shown on the right on the far right, are adding these ResNet or skip connections in the squeeze net. So in the end, these are the squeeze, net parameters and width, so this chart shows the different se one by one and III by three parameters and then how the on the far right you see how the pruning, which were pruning does is it just masks out weights that are, you know in between a certain range, so if the weight is like 0.02 it would probably be masked out compared to a weight like 2.8 which wouldn’t be masked out and then on the far left side, you can see the output size of the feature maps throughout the processing of the squeeze net, so this result. This is the most interesting result shown in the paper. This shows how squeeze net is able like just a table showing how it’s a very small model size compared to previous existing methods, so the top four techniques start from Alex net and then compress it with either singular value decomposition Network pruning or the prior state-of-the-art deep compression and so what they also find is when they take their squeeze net and then use the deep compression technique, which is a combination of pruning quantization with equipped with a code book and then Huffman encoding they’re able to reduce their model all the way down to 0.47 megabytes using six bit weights and they don’t lose any of their image net accuracy by doing this, so this is a really amazing result and it’s the smallest model for image net classification it out there, so these are some of the parameters that they that defines a squeeze net, the fire module layer like you have the squeeze ratio and the way that so the way that the S one by one to the eight E one by one and E three by three, how that ratio evolves throughout from fire module, one all the way to fire module mine in the network. So they show in this plot is that when squeeze ratio is 0.75 that the accuracy begins to saturate, meaning you wouldn’t get a better result with 0.8 0.9 or 1 and so it also shows that as you have a higher squeeze ratio, you’re gonna use more megabytes in the model and then on the right is the percentage of 3 by 3 filters compared to the one by one and that a 1 by 1 2 e 3 by 3 ratio of the expand component of the fire module. So this is the macro architecture parameters they explore the use of the ResNet skip connections, so they have the simple bypass, Which is what you use. When the feature maps match each other on the spatial resolution, and then they also show using a 1 by 1 convolution to add more skip connections when the dimension of the feature Maps Don’t just match each other. So you can’t just concatenate it because the dimensionality of the tensors don’t match, so thanks for watching this video on squeeze net. I think after watching this video, You’ll be really interested in another video on Henri Ai Labs, which is deep compression, which is the compression technique that was the previous state of the art and can be combined with the manual squeezed net design to achieve the half a megabyte. Alek’s, net level image net accuracy. So thanks for watching. Please subscribe to Henry. Ai labs for more deep learning videos.