Resnext | L13/7 Beyound Resnet

Alex Smola

Subscribe Here





L13/7 Beyound Resnet


Well, an obvious thing that you can ask yourself is what comes after resonate right, and that’s right next. I think we briefly covered that last time, but it was a little bit brief, so I’m just going to review that in a bit more detail, so remember in ResNet, you essentially have this stack of one by ones and three by threes and effectively, what happens Is that if I have one layer here another layer there, then within a single pixel, right, so you have this entire stack of, you know, channels. You want to turn that into another stack of channels? Okay, now, if you do that, then this requires basically a CI. Times CL operation, which is expensive and the dimensionality that you can afford is basically proportional to Co. And you obviously also CI, But really, you know, so you have two knobs that you might want to control separately, namely the number of parameters, which depends on this and the number of output dimensions which already depends on that and so race next is a very clever way of separating the dependency between those two. So what you do is effectively if this is my matrix, you know? Ci Times Cl. Rather than taking the full matrix, you’re basically approximating it by a block. Diagonal matrix we’re now within each block. Things are dense, but the rest is all zeroes. If you do that, you can afford a larger matrix, right, you can make it larger like so by having another two blocks, maybe and the number of parameters, alternatively, is reduced. So if you look at it again, basically by breaking up the convolutions into individual sub channels, you get, you know, more memory efficient tools, and if you then go through the detail calculation, so that’s what they did in the race next paper and they use the design that is very, very similar to a race net. Well, which isn’t such a big surprise because coming here is one of the authors on both papers, he basic, They basically engineered race next to have essentially identical number of parameters the number of flops, but just have more channels, but at the same time have no larger number of parameters of laws. Yes, okay, so sparse convolution matrices. Yeah, so effectively. This is a special sparse matrix, right, if you if you think about it. Well, what the question was is well, okay. Alex just wrote down a really weird block, diagonal, sparse matrix. Why can’t we have a general sparse matrix there? As a matter of fact, you could do that? The problem is that. Gpus are really awful. When it comes to sparse matrices, all right, part of the reason is that well. Gpus don’t like pointer lookup’s, so if the matrix is not too sparse, then your picture of actually just using zeros for the rest, so you might, you know, go and you know, have only 10% non zeros and you’re still faster off using writing it out as a thin, dense matrix and where that trade-off occurs, whether it’s at one percent or ten percent or thirty percent, that really depends on the GPU or the CPU on which you’re executing things. So in other words, you’re not gonna get any speed up by going sparse unless you’re extremely sparse, and the other thing is, it’s a lot harder to control the result of that. So it’s you’re much much better off having a predefined sparsity for which you can then optimize on your GPU. Now there’s no reason why you couldn’t have another few blocks in here, right. Maybe something like that locks bars. There actually be an interesting research project to maybe add. You know, another set of you know, Blocks bars into connections here. There are as a matter of fact. I’m going to show you a strategy. That’s very similar to that in the next few slides, so that’s truffle net, but it’s good question. Yes, exactly, so if I know where my sparsity is. And if the sparsity is predictable in such a way that I don’t need a secondary index structure to store the sparse index, then I can do this efficiently on my GPU, whereas if I need to always encounter some surprise by doing that sparse lookup, then my GPU is gonna be slow, so let’s look at the view of more ideas, and this is really just going to give you. You know the tip of the iceberg. So what idea is well after all? Resnet worked. So why not go even further, right, so if you think about it, you know, Racenet went from parameterizing. You know, F of X equals 0 to F of X equals X. That’s the simple function, but you can essentially get like a high order. Taylor series type expansion. So what you do is basically you define X. I plus 1 to be the concatenation of X I and fi of X I. So as a result, that Vector will keep on growing and it will basically have increasingly high order terms in X expansion. They start with X 1 is X X 2 is, you know, X and F 1 of X X 3 so typo here is X F 1 of X F 2 of X and F 1 of X, and I got bored after that because the expansions just get a lot longer, right, Basically, X 3 X 4 would be, you know this term plus. F of all of that time, right, that just gets tedious now. This is what the instead did and at the time. It looked like this was the right way to go, actually, quite surprisingly, if you train. Resnet really well or rest next, really well. It outperforms dense net. So how did they manage to do really well? And I think they got a best paper price for it for the Internet well. Their training implementation was better. So sometimes it’s not the network. But how you train it that lets you win benchmarks. This is the thing that isn’t always clear to people when they read papers. Right, you zoom in? How would you know, right. Sometimes the description of training is quite vague. Sometimes the only way to get to the bottom of it is to actually look at the code. Okay, so dense that it’s kind of useful, but not that much. Here’s one that’s actually a little bit more exciting. It’s called squeeze excite net or AC net, and this uses something that we will cover in a bit more detail later on, namely, something called attention so attention is essentially a mechanism where rather than taking averages over a bunch of vectors we’re using a separate function to gate how that average should be computed. And I’ll just leave it at that for. Now, We’ll get into that in a lot more detail later on when we cover attention in a lot more detail, but what squeezed excitement does is if you think about the various channels, Maybe there’s a cat channel, and there’s a dog channel, and maybe there’s a. I don’t know, dinosaur channel right now. If you knew that, you’re recognizing a cat. Well, what would you do? You would overweight that cap channel and you down weight, the wrist, right, but that’s kind of stupid, right, because I mean, how would I know that I’m recognizing a cat until I’ve actually recognized the cat, It’s like once. I know the answer. Well, you know, the question becomes a lot easier. The other thing is that the information transfer that I have in Convolutional net works is kind of slowest, right, so if you think about it, right, we have maybe a 3×3 convolution other 3×3 and we pull and so on so it can take like four five six layers until the information from this corner percolates to that corner, and that’s awful, right, because maybe if there is, you know, a bowl of milk here. I know that well. The chances that there’s a cat over. There is much higher, right so I would know that from the context, so my cat detector can use the fact that there’s a bowl of milk to infer that. Well, there’s a cat. So what are we supposed to do? Well, what you could actually do is you could take very simple inner products of the entire image, you know, on a per channel basis with some other vector. And so now you get some numbers, You know, you get channel mini numbers out of it. This is a very simple object. It’s fairly cheap to do. Compare it to all the convolutions in everything. And now you use those numbers in a softmax over them to rewrite your channels. So therefore, if this very cheap procedure tells me well, there’s a good chance that there’s a cat somewhere. I can now operate the cat channel, okay. I bet there’s no cat channel. But if there was one, it would do that. Suffice it to say. I seen it’s actually improved the accuracy, so they’re currently actually the best ones in the models. Yes, no! I have one waiting function for Leia. And so this basically gets computed in parallel to the convolution, and then the results from both paths are merged. Right, you’re basically performing a pixel wise vector multiplication of you know that waiting vector that’s written in nice, pretty colors with the original tensor because I have waiting That is global over all the pixels in in a channel right so it’s global in that sense, and that allows me to send information about what’s going on in the world very quickly to other parts of the image. Okay, so the weighting function. Okay, let me write that out, so let’s say. I have X, and that’s maybe in our, you know, channels. I’m going to drop the batch right now. Times, maybe height times width right and so now. I’m going multiply this by some weighting matrix and that weighting matrix is also going to be in our C Times height times width, and I put and I get the following result. Y is sum over height and width of X HW. C CH W. Times W. C HW. Y C. So these are now, you know, so basically. YC, That’s of course in RC. And then I go and perform update. YC becomes softmax of YC off. Well, Y becomes soft Max of Y, and now in the end, I can go and use that to rewrite every element in X, So every element in X then go becomes X C HW. Goes into YC X CHW. It would be, it’s not quite because you actually get one result for our channel per channel. Not quite right because you have one output per channel. If it were a convolution, then you would only have one. Y which would entirely defeat the purpose because then you’d have a single number by which, Yuri wait the entire activations. So now again, you have not made any preference between any of the channels, so if it doesn’t fit quite into into convolution, it fits much more closely into just an inner product, right, It’s really attains a reduction. Yes, yes, so ws are all learned and it’s a fairly small number of parameters, In addition to everything else. So the overall cost is reasonably benign, makes training a bit more expensive, but the cost is overall reasonably benign. And then you get high accuracie’s. Okay, now here’s the last thing, and this was in the direction of well. Can we do something a little bit? More structured or you know, more sparse, structured with our networks, right, so if you think about risk next. Resnick’s breaks up the channels into, you know, subgroups, and then you know each within each subgroup of the channels. You do your stuff, and then you know, you mainly combine right now. That’s not necessarily very good because you may end up getting those very long stovepipes, essentially where the features only mix within each of those, but not across them. I mean, after all, you, you got rid of the you. Know, cross channel mixing in order to get, you know, faster computation and everything. Now, one way to bring it back. Is you just go and reshuffle things in-between convolutions and so in this case, if we have three channels while I go and basically pick, you know, one from the Red Greens and blues and you know, turn that into a new block, and then I essentially intertwine things in a meaningful way that gives you a little bit more accuracy, so shuffle net is what you get out of that. And so they applied the shuffle operation to ResNet and to race next and to us in it and Gid. It helps yes. Yeah, they basically what you would have gotten before in. Resnet, if we look at that right, so you would basically have had. You know. Those four networks operating in parallel what they do is basically between every convolution. They mix up the features between the various networks, so they essentially add another permutation matrix in it and so without it only has, you know, unit weights, so there’s nothing to learn. Yes, it doesn’t learn how to shuffle. No, that’s an interesting question. Maybe somebody can figure out the way how to do that. My hunt will be that going from permutations to something. We have maybe two copies or three copies, but but overall, you know, log number of channels copies might potentially help, but I don’t know whether it would really make any difference relative to other architectures so to give them that the number of you know channels, isn’t that large? I mean, it might be 32 there. Isn’t that much that you can gain, yes? Why is it very efficient? Well, so, on a mobile net immobilization and risk next is one of those cases where you can get high accuracy for a small number of comparatively small number of computations in wrist necks gives you even higher accuracy for that. So now you have this trade-off accuracy versus speed and you can either try to win the benchmark by having a network that’s humongous and highly accurate and mind you. There’s essentially a shuffle net paper that aims for high accuracy. The title is a little bit different, but it’s basically same authors very similar network architecture, but high accuracy or you can go on this parameter curve of accuracy versus speed and you push for speed and shuffle net tends to be a little bit faster than, for instance, mobile net, so their number of other tricks that you can do, but that’s pretty much the bag of tricks that kind of work in the context of computer vision, probably next year. By this time, there’ll be there will be like three formal slides of things that work, and yeah, so one last thing will be. They’re separable convolutions If you will, so that’s in mobile net, that’s actually a precursor of race next, so race next, you know, has groups of channels Separable convolutions basically treat each channel separately separately, right, so if I have 20 channels, then you know, I can get 20 separate convolutions, Whereas in a race next, maybe I break up those 20 channels into five groups of four each the shuffle net. I would do the latter and then shuffle between them, okay. This is about it for you know what covers the more interesting parts of the models. Ooh, so to summarize a little bit, We talked about inception and race nets and the key point in Inception was essentially that you can mix and match different types of convolutions and you can use batch norms. Wristlet use this idea of a Taylor expansion rest next Decomposer’s convolution, so it’s basically separable convolutions, but was a bit more control, and then there’s this entire zoo of additional things that you can do and probably a sea net and shuffle net are the more interesting parts there, and that’s it for the model zoo. Now any questions so far on the theory, okay, good.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more