Transcript:
[MUSIC] This video will explain the wide residual networks otherwise known as wide ResNet and then frequently abbreviated in papers as wrn – the number of layers – the widening factor, so the headline from this paper is that a simple 16 layer wide. Resnet outperforms all previous resident models refer to this paper as thin resonance and we’ll explain what that means in the paper so the wide ResNet 20 to 822 being the number of layers and then 8 being. The widening factor, outperforms the ResNet, with over a thousand layers by 1% on C part N and 3.5 percent on C far 100 and so they don’t mention exactly the training speed-up on wide Resnet 22-8 but it’s probably similar to 2244 because the widening factor has a similar computational bottleneck to the plus 18 layers, But so they add this note that the wide ResNet, 44 with 40 layers widening Factor 4 and again we’re gonna explain what widely factor means in this presentation, so that’s 8 times faster to Train than the ResNet with a thousand layers. So these are some of the limitations of resonance resonance introduced. This skipped connection where you take the layer at L minus 1 the features from there and then you just concatenate it ahead to the layer L features so this really improved convolutional networks and in the lab for training for really deep networks so, Alex net, probably something like 8 convolutional layers, VDG ranges from 13 to 19 layers and this classic model of just stacking together convolutional blocks if used if you try to put like 30 or 50 layers of that, it’ll start to perform terribly, but with the ResNet, it actually will continue to improve, so what they find is that its diminishing returns. If you try to scale up the ResNet from say 150 layers to a thousand layers, you’ll be like doubling your training time for very small performance gains. So what they’re gonna do is in this paper. Is they’re going to take apart the ResNet blocks and they’re gonna widen it. So what I mean by widening it is they’re going to increase the number of features, so the channel dimension is going to increase So compared to this model right here. The basic wide is why’d because it has more feature maps, and so they’re also going to compare that with the bottleneck layer in the ResNet and then B and they’re gonna introduce this dropout in the IM between convolutions and show how this improves performance as well. So so, yeah, they may show a lot of good results with this dropout in between the convolutions and they had previously tried to put a drop down between this identity mapping. Remember this identity mapping just copies the previous layer activations and just kind of sends it ahead to the next layer, so they originally thought that maybe if they put some drop out here, it’d be useful, but that doesn’t get a good result. They find just putting it in. This mapping is successful so in the classic. Resnet there are two kinds of blocks that use the basic block. Which is this 3×3 block, which each for each one is a convolution batch normalization and then reloj activation. And then they also have this bottleneck, which they use to shrink the features Sort of like an interesting heuristic it isn’t. I don’t think it’s well understood, but they go from the one by one convolution, which preserves the spatial dimension, so I could be put in a 30 by 30 feature map and then put through one by one convolution. It’ll come out 30 by 30 as well. So the wide resonate just has a slight modification to this, where it changes the order so instead of convolution batch normalization. R A loop They’re gonna do Bachelor Ization rave Lumen Convolution and I wasn’t really 100% sure why they did this, but they find a good result with it. So the idea is to increase the represent a representational power of the resonate blocks million do this by adding more convolutional layers per block, widen the layers by adding more feature Maps, they call it planes and then they’re going to increase the filter size, which is like data kernel like three by three five by five, so they’re gonna play with these two parameters. L and K, where L is the number of convolutions in a block, You know, like how previously we saw. L equals 2 where you have 3 by 3 and then 3 by 3 convolutions. But you might, yeah, maybe having four or five, it’s useful, they’re gonna test that parameter. And then K is this widening factor and that’s the number of feature maps in the Kahn. Lucien layers. So the number of parameters are going to increase linearly with out, but they’re going to increase quadratically with K. However, even though it increases quadratically with K because you’re adding more feature grabs, This is like perfect for the GPU because you’re distributing the same tensor from the previous activation across the different feature map, so the widening factor, even though it’s a quadratic memory increase, it isn’t quite as bad on the computational side, so this is the notation that they used to describe the wide res nets. It’s the WR N and K and you’ll probably see this in other papers as well when they you know, propose some new technique, and then they give you like a table of different convolutional architectures that they compare the success, their technique across so the NR refers to the layers, which is in this case, 40 and then 2 is the widening factor That’s just compared to the original ResNet and the wide ResNet so 2 just means, like double the feature maps of the original resident model, so this is the wide ResNet over all the structure of the network and it’s these different convolutions and convolutional blocks in which each one of these blocks represents this kind of structure so again, The Classic Prez net is thin on the filter maps aren’t that deep compared to this one where they have really wide, They’re like a lot of feature Maps. So these are the first day tests with these different types of convolutional structures, so run first, they go three to three 3×3 convolution one by one to three by three and all these different variants, but then when they test this, they don’t really find such a significant see. The performance is basically the same for each of these variants, so they just proceed ahead with the 3×3 convolutions. And then this is the L parameter, saying how many 3×3 convolutions should the intermediate features go through, and so they test with one which performs worse and then so 2 & 3 about the same, so they overall just decide to stick with 2 So here is a more interesting table. This is where they have different depths and then different widening factor. Thirdly, wide ResNet and they get the best results with a depth of 28 and then a winding factor of 10 So yeah, so this is pretty interesting result, so then here is where they compare the wide ResNet with the original ResNet and then the ResNet with the modified – Memoization Rayleigh convolution like block structure rather than the other way up. And so you see how they outperform all the previous other methods, even though they use more parameters, it takes less time to train because the like the widening factor is easier to parallel lies across the GPU, So it’s easier to train faster. Although it is more memory heavy, so then they show the effectiveness of using dropout and you can see that compared to the dropout of counterparts, it’s like a slight improvement in all cases. Well, not even really in all cases, really like a really minor, definitely more pronounced in the sea far 100 But I mean, they show it works. I don’t think it’s such a significant improvement. Though, so thanks for watching this video on wide residual networks. Please subscribe the Henry. Ai labs for more deep learning, paper summaries and videos.