Transcript:

Welcome to a presentation on Google Net, the winner of the Imagenet Large-scale Visual Recognition Challenge in the year 2014 Google, it introduces new techniques and insights for image classification. And that’s the subject of this presentation. Google Net is the first network to come along in 2014 that that’s secured towards efficiency and by that we mean smaller and faster models prior to Google and we’ve seen convolutional neural networks such as Alix Net and Oxford net introduced successful tools, such as pooling and dropouts for regularization reasons. We should care about Google Net are that Google that consumes 12 times lesser parameters than Alex net at the same time. Google Net is significantly more accurate than Alex net, lower memory use and lower power use are paramount for mobile devices. Google net stays within the targeted 1.5 billion multiply ad budget at inference time. Overall computational cost is less than twice compared to Alex net, although the network is much bigger than Alex Net at the heart of Google. Net is the inception module, and this presentation introduces the inception module, the microarchitecture, on which the Google Macro architecture is built the evolution in convolutional neural networks, had its start in 1989 in a modest network with three hidden layers. It consisted of a layered network composed only of convolutional and subsampling operations, followed by fully connected layers and ultimately a classifier layer for classifying handwritten digits. The connections for the lower two layers in the network were convolutional and the connections were the upper two layers were fully connected. Each hidden layer is a higher-level representation of its lower layer Its worth studying the simple architecture Since most modern networks have a similar architecture, albeit more recently, we’ve seen newer layers such as pooling and dropouts that help with over fitted models by reducing the number of parameters. The input is a 16 by 16 image with a single greyscale channel. The output layer produces a code, which is one of the ten handwritten digits that were presented at the input. Let’s look at the first hidden layer h1 and its connections to the input. The hidden layer h1 has 12 feature. Maps H 11 through H 1.12 The kernel size is 5 by 5 and the stride is 2 That means the input image is under sampled by half. Thus, each feature. Map is of size 8 by. I 8×8 giving us a total of 12 times 8 times 8 768 Hidden units for h1 Let’s compute the total number of connections. This equals the number of hidden units h1 which is 768 times, the number of channels in this case 1 grayscale channel times the kernel size, which is 25 plus a bias term for each of the 768 hidden units. Let’s also compute the total number of parameters since this is a convolutional layer and we share the weights applied to each image patch. The total number of parameters is the channel size times the kernel size times the number of feature maps in this case 12 plus the 768 bias terms for each of the hidden units. Now let’s look at the connections between the hidden layers h2 and the hidden layer h1 h2 also has 12 feature maps H 2.1 through H 2 Dot 12 But the connections between H 1 and H 2 are heavily constrained. Each of the units in H 2 combines local information coming from only 8 of the 12 different feature maps in H 1 This was done for an important reason that is to break the symmetry in the network and thus improve learning. The convolutional kernel size is 5 by 5 and the stride is 2 and that gives us 12 times four times four hundred ninety-two hidden units. The total number of connections is calculated as follows and that is the 192 hidden units times the size of the kernel, 25 times the eight channels coming in from the h1 layer, plus the 192 bias terms. The total number of parameters is likewise constrained to the kernel size of 25 times, the eighth input channel times the 12 feature maps plus 192 by storms. Now we look at the connections between hidden layer. H 3 and hidden. They’re h2 The layer h3 has 30 hidden units and it is fully connected to h2 finally, the output layer has 10 absorbable units and it is fully connected to h3 as well fast forward to 2012 and it was a watershed year where the top 5 error in the image net contest dropped by more than 9% because of the use of a convolutional neural network now referred to as Alex, not since the resurgence of Cnn’s in 2012 variants of deeper and deeper. Cnn’s have won the Imagenet contest and by contest. We mean, the annual originated Large-scale visual recognition challenge. The graph shows that deeper nets, with more parameters to learn do better at classification. The extreme case being resonate the winner in 2015 with 152 layers, while Google and the winner in 2014 used a relatively modest 22 layers later. We will look at the idea put forth by Google Net called Network and network, which is a micro network embedded within a macro network before we do that, let us review the architecture of a conventional neural network up to this point, A typical design pattern for convolutional neural networks, or Cnn’s is, as follows. We have a stack of convolutional layers with kernel sizes as high as 7 by 7 and 11 by 11 and Strides such as 1 or 2 interspersed between convolutional, our contrast, normalization and MAX pooling layers. The penultimate layers towards the end of the network were typically fully connected layers, although that’s no longer the case with the introduction of global pooling layers, the last layer is typically a loss layer, which encapsulate a loss function and that hasn’t changed. Drop out layers were introduced to reduce the number of connections and overcome problems of overfitting with this type of approach. The trend has been to increase the number of layers to get higher accuracies. Here we highlight the computational and economic challenges of very deep neural nets, mainly. There are two challenges, number one, adding layers increases the number of parameters and makes the network prone to overfitting as we are in the supervised learning domain. The larger networks need more data and even with data augmentation techniques, the amount of data may not be sufficient, of course, gathering and annotating more data will be expensive. Second deeper networks lead to a computational explosion linear increase in filters results in a quadratic increase in compute. Also if weights are close to zero, we’ve wasted computer resources, fast-forwarding from 2012 which marked the birth of Alex net to 2014 when the designers of Google net took aim at efficiency and practicality, rather than a sure fixation with accuracy. Overall, a good thing. The resultant benefits of the new architecture were that the model size was 12 times less than Alex Net and significantly more accurate than Alex net, lower memory use and power use are vital for mobile devices as were well aware. Also, the designers had a compute budget of 1.5 billion multiply add operations and they met that goal as well. The computational cost was less than two times compared to Alex net, which is an eighth layer network, whereas the Google net is at 22 layer Network. The inspiration guiding the inception module is earlier research work that theoretically proves that optimal neural nets can be built if we cluster neurons. According to the correlation statistics in the data set, these findings can be applied layer by layer that is we analyzed the correlation statistics in the previous layer of activations and cluster neurons with highly correlated outputs for the next layer. These findings are immediately relevant to images where there already exists high correlation between local pixels in a neighborhood patch. So we can cover them by a small one-by-one convolution. Additionally, correlation between a smaller number of spatially spread out clusters can be covered by 3×3 and 5×5 convolutions. These findings also suggest that we can apply all the convolutions on the patch, The 1 by 1s the 3 by 3 and the 5 by 5 Instead of picking one of them, The theory also suggests that we could apply a max pooling layer in parallel and that’s. What is done as we see later? This slide visually summarizes what we said Earlier 1 in images correlation tends to be local, so we can cluster the neurons simply by taking convolutions on local patches and 2 we can cover local clusters by small 1×1 convolutions. We can cover more spread out clusters with 3×3 convolutions and we can cover even more spread out clusters with 5×5 convolutions. And then we want the effect of all of them, so we stack or concatenate them on the same patch. So conceptually, the inception module is simply the concatenation across 3 convolutional scales, a 1 by 1 convolution, a 3 by 3 convolution, a 5 by 5 convolution and a 3 by 3 Max Pooling one of the intuitive and practical benefits of using multiple web convolutions on a single patch is that visual information is being processed at several scales and then aggregated for next stage. This improves the discriminatory power of the network. But there is a big problem. The network, as conceived increases computation several fold when we have a large number of feature maps. This slide shows the solution to the computational explosion, which is the idea of dimensionality reduction. One by one convolutions are used to use the dimensions before the expensive 5×5 convolutions and 3×3 convolutions are used. Each of the blocks shown here also includes a rectified linear unit activation or value, making them dual purpose. Now that the compute budget is controlled, this allows the network to have more stages. We should note that the inception module gives us the knobs and levers for a controlled balancing of computer resources and speed, resulting in networks that can be three to ten times faster. Therein lies the value of the answer. We now succinctly summarize the insights presented so far. Google Net analysis leads to the following architecture choices, which is insight number one so choosing filter sizes of one by one three by three and five by five and no higher, applying all three filters on the same patch of image, no need to choose any single one of these concatenating all filters as a single output vector for the next stage, concatenating an additional max pooling path since Max Pooling is essential to the success of Cnn’s Google Insight number to decrease dimensions whenever computation requirements increase via a one by one dimension reduction layer, use inexpensive one-by-one convolutions to compute reductions before the expense of three by three and five by five convolutions. One by one convolutions include a 1 by 1 value activation, making them go purpose were now ready to talk about the prior groundbreaking research, the network in network concept that was yet another source of inspiration for the Google and designers think of the inception module as an independent microarchitecture being replicated at each macro layer to create the Google Net macro architecture to create a deep network. We just need to stack in Substant. Mughals side-by-side. Between these inception modules occasionally insert Max pooling layers with stride to the stride of two decimates by half the resolution of the input grid. Experimentally, The designers found that stacking inception layers benefits The results when used at higher layers, lower layers are kept in traditional convolutional fashion stacking allows you to tweak each module independently without uncontrolled blow-up in computational complexity later and therein lies its benefit. This visual shows the stacking of the inception modules within the Google net macro architecture. We introduced a average pooling layer and a new linear embedding layer, which I’ll explain in the coming slides because the network is rather deep. The designers of Google net were concerned about the gradients vanishing in the intermediate layers during back propagation of errors. This could possibly happen due to dead values in which case the weight updates cease to occur and the hidden units stop learning to combat this phenomenon. The designers used the intuition that the intermediate layers do have discriminatory powers since they already have higher representations of the image. So why not tap into these intermediate representations and use them for the back propagation of errors to do this? They append auxiliary classifiers to the intermediate layers. During training, the intermediate losses were added to the total loss with a discounted factor of 0.3 The details of the network are shown in the next slide. The last insight that Google Net provides is the idea of replacing the fully connected layer at the end of the network, with a global average pooling layer. Fully connected layers are prone to overfitting, which hampers generalization. Luckily, global average pooling has no parameters to be learned and thus no overfitting the overall network benefits from lesser parameters. As well. We see that same principle used in yet. Another network called squeeze net, which also utilizes the concept of networking network note that global average pooling does not exclude the use of dropouts A proven regularization method to avoid overfitting note, also that Google Net provides a linear layer at the end of the network for adapting the network to other labels. This is helpful when you are working on your own problem that needs to classify just a few classes. This concludes all the insights provided by Google, And now we could say, summarize all the five insights that we’ve discussed so far number one exploit fully the fact that an images correlation tends to be local concatenate one by one three by three and five by five convolutions, along with Max pooling for every image patch decrease dimensions Whenever computation requirements increase via the one by one dimension reduction layer number three stack inception modules upon each other from the principles of Networking Network number four counterbalance backpropagation downsides in deep networks use intermedius loss layers in the final loss end, with global average pooling layer instead of a fully connected layer. And that concludes this presentation.