Transcript:
[MUSIC] This video will explain the resident a deep convolutional neural network architecture design. This is one of the most popular neural network designs that have ever been published with over 20,000 citations. Deep learning is thought of as learning a hierarchical set of representations such that it learns low mid and high level features in images, this is analogous to learning like edges and then shapes and then objects, so theoretically, more layers should enrich the levels of the features and previous models to the ResNet typically have depths of 16 and 30 layers. So the idea is that shouldn’t building better neural networks because easy as adding more layers to the network, so the first contribution of the ResNet paper is showing that if you just continue to concatenate convolutional layers on top of activations and batch normalization, the training will eventually get worse, not better, but they offer this insight. The construction insight that says, if you consider a shallow architecture and its deeper counterpart with more layers, theoretically, all the deeper model would need to do is to just copy the output from the shower model model with identity mappings. So the construction solution suggests that a deeper model should produce no higher error than the shallow counterpart. However, the identity functions aren’t an easy function to learn and so therefore, the residual functions formulate the layers is having a reference to the input through these identity or skip connections such that theoretically, if it needed to push the layer down to zero, it could easily do it in this framework so again. This shows the residual connection, which is the building block for the residual network or resident. So one interesting thing with res. Nets is if the previous layer dimensions don’t match the input to the next layer. You think about a convolution? A 3×3 convolution would change the spatial dimensions of an image from like 32 by 32 to 30 by 30 so they do here is they propose different schemes for up sampling the previous input layers to through this identity skip connection so one of the two ways it can do. This is it can either just zero pad the outsides of the spatial dimension, and there’s no extra parameters with doing this. It’s really quick or I can expand the dimensions with one by one convolution. So this image shows what the resident looks like, in contrast to a 34 layer plane. Network, which is a series of convolutional layers, followed by activations, followed by batch normalization and compared to another really popular model, The VGG 19 so the resident experiments they test the 152 layer net on image net and this gets their state-of-the-art results, and this is eight times deeper than VGG nets, but in terms of the floating-point operation measurement, it actually has less less computation and the VGG 19 shown by these billion flops metric. So the ensemble event resonance they’re able to achieve three point five seven percent error on the image net test set, which achieves them the state. VR they also test this on C 410 with a hundred and a thousand layers, and then they use the ResNet features on the cocoa object detection so with the object detection network’s work is you would use, like VG or ResNet to extract the features from the image data set and then you would classify the different bounding boxes based on some region proposal algorithm, so some more details about the resident is that it uses a batch formalization after each convolution and before activations, it uses the Hey Initializer, invented by the author of the paper climbing. Hey, you use the batch size of 256 They have this learning rate scheme, the weight decay and then also, interestingly, they don’t use dropout and they have they test on this one interesting test time augmentation, where they don’t just predict on the test image what they do is they take ten crops from the test image, and then they predict the model predicts on each of the crops, and then they average the prediction that to form the final prediction, so the first experimental result is showing how the ResNet continues to get better as you go from 18 to 34 layers, but the naive concatenation of convolutional layers is already starting to get worse, so just say 27.8% to 25 percent error rate, whereas the play network goes up almost half a percentage from the increase in layers. So then they test this idea of when you’re skipping ahead and the dimensions don’t match to you. Zero pad it. Do you have these one-by-one convolutions? And how frequently do you use the world above all collisions, so they do find that when they have 1×1 convolutions or known as projections that they do get a slightly significant performance boost, but it comes with the cost of having a significant amount of extra parameters so one of the thing they do is when they train the present at 5101 and once these two is they extend the Skip connection, so it skips ahead Two layers rather than one like the normal residual building block and this is done just to save training time. So these are the results of the different levels of ResNet, the B and C denoting the different ways of doing the projection matching and then compared to some of the other said via our models like Inception VGG. Yeah, so this is the results of the ensemble of resonances at the state-of-the-art on the top 5 predictions on the image net test set and also these are the results from the CFR 10 data set. And interestingly, in this, is you see that when they try to go from 110 layers, they achieved the city art with this, but when they try to go to 1202 layers, the error goes back up, so they haven’t quite figured out how to make it go that deep yet. And then this shows how using the features extracted from ResNet outperform VGG on the localization or the bounding box detection task. So again they find that when they try to, they do figure out how to make it significantly deeper than like the VGG 19 layers, but with this mechanism, the 1202 layer and that still does not perform well and they suggest in the paper that this is due to overfitting, so thanks for watching this video on Rez Nets. Please subscribe to Henry. Ai Labs and the paper link is in the description [Music].