Transcript:

In this video, we will look at Google Net. The basic building block of this Google Net is the inception module again, the 2014 winner of the imagine a challenge, so let us look at the basic building block of the inception model. So for instances, he had for the VGG network that we saw in the previous video. The basic building block was a block of convolutional layers, so see, very small filter sizes. And if you looked at Wave (())(00:47) Alex net Now, Alex that had variable filter sizes so each first layer and eleven cross eleven, and then we had five cross five filter and the we had three cross three and then a fully conditioned. So the inception module incorporates both this concept since that every layers has all possible filter sizes, so they build a clock a convolution block, which had multiple filter size, and we let the network the backdrop, the learning to which limitless and let the backdrop decide which weights to update based on your objective function. So let us just quickly look at this inception module. So if you look, there are two images, so this here is what the authors of the Google Net paper called the Naive implementation and this is the actual implementation that they are followed. So what is the different? So we have an input feature map. This is thickness of the feature map right here. That means a different color. So this is the thickness of the feature map as I have pointed out of here the size some K feature map and we have one cross one three cross, three, five cross, five and three cross three Max Pooling also applied to this feature maps, The outputs each of them are taken and concatenated so they are gone getting it across. So again, this means that we have to do that convolutions so as to get the same sized filled feature map from each of this convolutions. So there are all zero padded accordingly and to get the correct size, which are enough so that they can all be concatenated as you will see doing this as a problem because the number of computations becomes huge. So if you just show, this is the naive implementation, so but doing it this way. Implementing the block the inception module as they call it. This way has issue in term of the number of computation involved, it becomes huge because we have large filter sizes five by five, also in that, we can also use. I think it is possible to use seven cross seven, also, but the authors of the Google net inception module just stop with five by five, probably work very well for the image challenge, okay. So what does the better implementation? The better implementation is to use one cross one convolution to reduces the size of the filter maps. Since when I say size. In this context, it is the depth of the feature map. So if you are input layers has let us say. The input layer has 28 cross 28 cross. Let us say 192 inputs so many feature maps, so in this case? I mean, I had it down here. So that is where we are going to do. The one cross one 192 so we would use the one cross one convolutions so project this number number of feature map to a smaller number to a smaller volume. So what I mean by that? We know that as if you recall, we know that one cross one feature maps to preserve the size of the feature drops, in essence in plain, the XY size of the feature, maps are preserved, but we can use one cross one convolution to reduce the depth of the feature map, so that is particularly precisely what the inception model accomplishes so so if you have a very large in a sense of a very large number of featured maps in the input volume, you would use this one cross one conclusion to bring it down, so for instance, in this case. Let us say we take the one class one convolution before the 5 cross 5 we can use the one cross one convolution to reduce the size of each other up to 28 cross 28 Let us say 16and then we can output. Let say from the Phi Cross Phi Convolution. So you can Output 28 cross 28 cross 128, feature so just to recall the how this is done is by defining 16 one cross one convolutions, of course, one cross one Grand Illusion Act across the volume across the featured maps. So you can also question whether this would, you know, lead to loss of information, so this seems to be a parameter. They have tweaked the number of feature maps here. This has to be done by some kind of cross validation to see, which is the most optimal, so this one cross one convolutions prior to doing the larger convolution with in a sense, larger convolution in a sense of conversions with larger filter sizes, so the one trust one project, the input volume to a smaller dimension. And then subsequently you use three cross three convolutions to put them back. So this is what is referred to now as bottleneck bottleneck layers. So you string the size of the feature map that way it works similarly for the max pooling. Also, you can do because Max pooling would in most preserve the size if the feature maps the depth of the feature maps that is and you can use again, one cross one coefficient to reduce the size, so this was the begin, the two of the big innovation in an inception module So instead of just so if you saw in VGG that each block, they decide a divide the network into block so each block had a succession of condition. So in this case for inception module, it incorporates multiple conclusion kernel sizes Just like in AlexNet. We saw that we used eleven cross eleven and five cross five three cross three, but in this case, it is five cross five three cross, three one cross one and a maximum layers all in one wall in one module, and you also use this bottleneck, one cross one convolutions to project the input volume to a lower dimension in terms of the depth of the volume before we do the three cross three column and the other conditions with higher filter kernel size. So you just look at the network. This had about twenty two layers with mates and so there was an initial set of convolutions and Max pooling, which reduced the size of the input to 28 cross 192 so the input, as usual was a 224 cross three. It was followed by sequence of Convolution and Max pooling sequence of conversion and Max pooling to get it to 128, was 192 and then that was used as input to the inception layer sequence of inception layers, followed by Max Pooling again sequence of inception layers, followed by Max pooling so on till we have the typical output with one of thousand activations, so it had so they have labeled the inception module as three for B. This was the if you read the paper. I urge you to go read the paper where see that they have labeled each of these convolutions. The inception modules by you know, 3A 3B in this case for 4A 4B and there is corresponding cable, which tells you how the computations are done in that particular module. So we will just walk through one inception module that is 3A and see how the number of the saving in the number of computations by using a one cross one bottleneck to reduce the size of the field, reduce the number of feature maps. So this is a 22 layer network, but then it had very few parameters about five million parameters and it with 1D 2014 meeting challenge, which I thought for error rate of about 7 percent, it is slightly better than VGG, but with Much-much lesser number of parameters. So this was considered this again is one of the networks which have, and if it is very small, not very deep, but in terms of number of parameters, very less number of parameters, the contrast to let us say Alex net so that we need G media are originally 1638 million, Alex not at 60 million parameters, but this only had 5 million parameters South weight network, so we will look at this inception 3A right here, which takes us. Input 28 cross, Ronda cross, 192 and the output is think 256 So just look at it, so we reproduced a piece of the table here. I urge you to go back and look at the table. So this is the output size of the max pooling layer, which feeds the as input to the inception, is the input Tween to the inception module three and this is the output of the inception model Threeso. It has 64 one cross one convolution. So we saw that if you go back and write it down there just to see. And then it had 96, three by three reduce in that table means the number. If one cross one convolution done, the 96, feature maps were produced by the one cross one convolutions prior to doing the three cross stage. So you would produce 128, feature maps with three by three again 16 one cross one feature maps produced by the one cross 110 Aleutians, following by 32 and then 32 from the MAX pooling layer, so the output of the max pooling layer. If you go back, if you recall some of these numbers, so we would do here, so what it mean here is the reduce is basically here so prior to three cross three, we would have 96, feature maps here and price to Phi Cross five will have 16 feature maps and there, then the output of the one cross the Max. Pooling would give you give you. We sought 32 feature maps. So and if you look at the output of the three cross three conditions, I think this has about if can go back and look, I do not want to go back and forth again. So the three by three produced about 128, and if I cross five produces 32 right, the pooled projection layer produces 132 there is a max pooling layer and the one cross one convolution. It sells produces 64 This is from three cross three. Of course, there is 96, coming in to this. I am not mistaken. Yes, the reduce is 96, and reduce is 16 here. But this is the one cross one convolution. This is the three cross three. If I classify. And for the pool projection, there is nothing for the other one. There is one cross one convolution. This is directly from the input. These two are from the mean. So the hash three by three cross three is reduce the number of feature maps produced by the one cross One condition recall that I will once again back here. Recall that the one cross one condition are done prior to the three cross three and the five cross five and the one plus one convolution following the max pooling. There is no nothing before. Then there is one plain one plus one convolution layer. So which is what we have here. So this is 64 one cross one 64 feature map stories from one cross 196 by the one trust one prior to the three cross three and then output from the thick roster is 128, and output from the five cross five is 32 but prior to that you reduce the dimension to 16 So what is the savings in terms of nominal computations? So if you look at the following output of the Max Pool, which is the input to the inception module is 28 cross 192 now. If you directly do 192 three cross three convolution, of course, then the size of the number of parameters in every filter is three cross. They cross 192 and then we produce let us say 128, feature maps that is 28 Trust one day trust 128, then the number of operations would be about 173 million. Now, if we do, let us do the one cross one projections into a smaller volume. If you do that, then we produce 96, feature maps following the one cross one condition. And then we follow it, But with the three cross three convolution to again produce 128, So we have to do this calculation here for the one cross one convolution. So the easiest way to this is again. Very easy to write it down. I will write is down for the thing and then the OK. So if you have to do three cross three convolutions to produce 128, feature maps of size 28 by 28 so the number of elements in the output feature map or the number of activations note, which out map is so much right, and then for each feature for each output activation, we have to do how many computations three cross three cross 192 of products right, so three cross three cross 192 That is pretty much what you see there, OK? That comes to one by seven remaining. You can repeat the calculation here here for this one, too. So the output feature map here for the one cross. One contribution is 28 times 28 times 96, This is the total number of activations produced for each activations. We have to perform one cross one cross 192 multiplication, so that is what we have here and then similarly for the three graphs take on illusions. We produce 28 cross 28 cross this plus 28 cross 128, right, And for each one of these activation, we have to perform three cross three cross 96, multiplication. So that is number is here. We add you see that there is a reduction in the number of computation that you have just for this one particular feature map set of feature plums. So this was the innovation behind the inception, so it does two things it let us the network decide which feature maps are more relevant, based on the back propagation or the optimization. At same time she saw earlier again, even with VGG net, we saw that using larger size of type field means that the number of parameters increase in number of computations also increased correspondingly, so that is solved by using bottleneck layer using one cross one convolution, where one cross one is used to project your feature maps to a smaller dimension. Here the reduction is along the depth of the feature map, so the volume becomes smaller. So we will look at the inception. Three a layer, so the table is reproduced from the paper. It tells you the number of one cross one three cross three convolution, etcetera, in every layer, so just let us go one layer by layer the input to the inception three a layerâ€™s of size 28 cross 28 cross 192 layer. The output has same size 28 cross 28 cross 256 So now it has number of one cross. One convolution is 64 so there are 64 here. The hash three by three reduce, refers to the number of one cross one convolution maps produced by the one cross one conversions preceding the Tigress Three to which means that this one cross one convolution layer will produce 64 feature map of the same size. G for feature map and the three cross three convolutions will output. Three cross three ground water put 128, So that is are the two, the three hash three cross three reduce here refer to the convolution layer, one cross one convolution layer preceding the three cross three convolution. So it refers to this one right here. So the number there. The number of feature maps produces as output. There is 96, these serve as input to the three cross three convolution to the three cross three produced an output of 128, which is concatenated there. Similarly, the hash five cross five reduce refers to the number of feature map from the one cross convolution at 16 and the five cross five itself produces 32 feature maps. The pooled projection layer produces 32 followed by one cross one convolution, so that remains fixed in this case, So the total number is if you add these up if you add 32 these numbers, these are the so if you put these together, you get about 256 output feature maps, so this is just for one inception block, so you can work through the table in the paper and see if the current calculation are consistent with the structure that I showed you earlier.