WHT

Pytorch Expand | Pytorch Common Mistakes – How To Save Time 🕒

Aladdin Persson

Subscribe Here

Likes

657

Views

12,560

Pytorch Common Mistakes - How To Save Time 🕒

Transcript:

In this video, I will show you how to save so much time. When training neural nets, they’ve even been under Karpati approved. So you want to become a neural net god? Yes, then watch this video number. One didn’t overfit a single batch first. All right, so let’s just say you’re done setting up your network. The training loop hyper parameters, etc. It’s time to just start training, right. No, don’t do it, it’s very tempting, but just don’t trust me. So let me show you what to do. Instead, where you’ve set up your training loader, you’re going to want to take out a single batch, right, so how we can do that is we can do data. Comma target equals next of iter of the training loader. Right now. We have a single batch and then where we have our our loop going through the training loader. Well, just uncomment that all right, Well, we will comment that, and then we’ll dedent everything here, right, and so what we’re going to do is we’re going to run this single batch for a number of epoch’s right now we’re having number of ebooks equal three, and so we can have a batch size of 64 but it might be better just to check. Can it overfit a single example, right, if it can do that, then then perhaps we can try larger batch size, But let’s just do, uh, that and let’s try and run this first. So we ran for three bucks. That’s obviously not enough. It’s it’s decreasing, but it’s not very, uh, very low, so let’s just change this to a thousand and rerun it. All right, so as we can see now. The loss is very, very low, so it’s overfitting a single example. Now let’s increase this. Let’s have a batch size of 64 and lets. Rerun it, all right, so it’s becoming very close to zero, meaning we can overfi’t the single batch right now. We’re confident that our neural network has the capability and there are no bugs. This is a very quick sanity. Check to see if the network is actually working. Trust me. This will save you so much time every time you implement in your network when you set up your training. Everything just overfi’t a single batch first. Just do it, all right, so we’re going to remove this thing right here, and now we can bring everything back as it was in the beginning like that, all right, number two forgot to set training or evaluation mode, so the next thing is when you’re actually checking the accuracy, you want to toggle the evaluation mode of the network, so if we’re just doing check accuracy like this, and we’re not doing modeleval inside of this check accuracy function, then we’re going to get a lot worse performance, so what we want to do. Well, actually, let’s compare the two, so let’s do a modeleval and then we’ll do check accuracy, test loader and model, and then we’ll do modeltrain. Uh, so we’ll toggle it back and let’s just run this and let’s see what the difference is, so I think now we’re just training for, uh. Yeah, three epochs. Yeah, so as you can see just by toggling, the modeleval will get I mean, uh, greater than four percent improvement. That’s a lot, okay, so that’s a big big difference now. Why is it so important to do modeleval? Well, if we check in our model here, our smaller network we’re using dropout and when we’re converting when we’re toggling, the evaluation mode of our network we’re actually removing the dropout and we’re doing the appropriate scaling that’s needed for the weights, etc. So when we’re actually evaluating our model, we don’t want to use dropout right or, and for example, we don’t want to use batch norm either so or let’s see we want to use the, um, the computed averages during training when we’re doing evaluation for batch norm. But anyways, what’s important to know Is that when you’re testing when you’re checking on test data et cetera, you want to always do modeleval before, and then you need to toggle it back on, do modeltrain so you can continue, uh, training, so this is a quick one, but it does a big big difference, so always remember to do this number. Three forgot to zero grad. This one is quite simple, but it’s also gonna do a big difference, all right, and it can be kind of hard to debug, right. This is a sneaky one that you might not notice. So what we’re going to do is lets. Just, uh, remove this optimizezerograd and let’s just run it and we’ll see what kind of accuracy we’ll get and remember now we’re using the modeleval and the modeltrain as we should when we’re testing our model not using optimizezerograd. After three bucks, we get 64 accuracy or almost 65 all right, so let’s put this back. We’ll, do optimize at zero grad, and then we’ll see what we get, you see, so that’s about 30 difference in the test accuracy, which is insanely. Um, like that’s that’s so huge. Okay, so if you forget this, you’re basically screwed. Don’t forget to use this. And, uh, why it’s so important to use. This is because you want the gradient step, right, The OPTIMIZER step to be done on the current batch. If you’re not doing OPTIMIZER zero grad, you’re using all of the accumulated gradients of all the previous batches. So that’s not what you want to do. You wanted to zero grad. There are no accumulated gradients. You want to compute the loss for this current batch? And then you want to do a step a gradient step for this current batch, all right, so zero grad backward and then step four using softmax with cross entropy loss, so a very common mistake is doing something like selfsoftmax, and then we’ll do NN Softmax and we’ll specify the dimension in this case. Its dimension one, and then we’ll use the softmax on the output, right because you, you always see that people use softmax on as their output layer. Right now! The problem here is when you’re using softmax as your output, but you’re using cross entropy loss and this is because cross entropy loss is essentially two things, it’s first softmax, and then it’s negative log likelihood. And then you don’t want to do softmax as your output because it’s already including the cross entropy loss, so you would essentially then be doing softmax on Softmax, which, and that might be a problem because you, you might get vanishing gradients problem. Uh, because, uh, because of this, so you don’t want to use two softmax if you’re using carcentropy loss and, uh, we can see, let’s see how big of a difference this gets, and I’m gonna just gonna pay something here, all right, so don’t bother about this right here. But this is essentially just so that we get, uh, deterministic behavior. I’m gonna go into this in a separate video. Actually, so don’t bother about this. This is just so we can compare using softmax and not using softmax. So I’m going to run this and we’ll see what we get so using softmax and then another softmax. I guess we get about nine. We get 92.78 and, uh! I mean, that’s pretty good, so this. Is it doesn’t do that. Let’s see how much it actually impacts, but it’s not going to be like the, uh, like the zero grad. Uh, that was like 30 so let’s rerun it now and let’s see what we get. So the difference is about 1.2 percent. That’s pretty good, right, and it’s also going to be faster training, not using the softmax. So this is a quick one that’s going to, you know, give you a better. Some better performance number five using bias when using batch norm. So let’s say we have some convolutional neural network, a very basic one. We just have a comp layer. We have a max pool, another complaint and then the linear layer to the number of classes at the end and let’s say we want to add a batch norm, so we’re going to do self batch Norm 1 and then Dot batch norm to D and we’re going to use it after Con 1 so Com1 has a out channels of 8 so we’re just going to set 8 right there and then we’re gonna do, uh, self dot bn1 after the COM1, and, uh, let’s run that and actually to get a good comparison. Let’s also set the, uh, the seeds for deterministic behavior, so we can actually see if there’s a difference, but yeah, so let’s run this, all right, so we get 98.25 and the the thing is here that when we’re using a batch norm after a comp layer or or a linear layer, anything like that, we and we have a bias term, that’s actually a an unnecessary parameter. It’s not going to cause any anything that’s horribly wrong, but it’s just it’s unnecessary, so we can set bias equals to false here, and that should be, uh, equivalent, Uh, and lets let’s see if it is, so it’s 98.25 so let’s run this, all right it actually. It was slightly worse for some reason but 98.22 This should be equivalent so so this example might not show that, but anyways, when you’re using batch norm after a complainer or anything with a bias term, you can actually set the bias term equal to false. You don’t need it because that’s included in the batch norm number six using view as Permute. So the next thing is the difference between view and permute. So what we’re gonna do is we’re just gonna create some tensor, uh, towards the tensor, and we’re just gonna do. I don’t know, one two three and, uh, four, five six, so we’re doing a two by three tensor. And, uh, oh, and then we’re just gonna do print. Um, X and then we’re just gonna. I don’t know, let’s say you wanted to actually permit. You want to do transpose, so you want it to be? Um, you know, you want to have the first column as one two three and the second is four, five six. You might think that Vue does this so you could. For example, do X dot view. And then you’re gonna set the shapes. You’re gonna do three and two, and you’re thinking that this is actually a pre like a transpose or a permutation. Yeah, so you’re actually permuting the the dimensions, but you’re not so this is not the same as doing X dot permute and then one and zero, and so this is transpose, right, Transpose is a special case of permute, but anyways, if we run this, we get, so we get one two, three, four, five, six, That’s the just a tensor, and then we get one two, three, four, five six. That’s not the same as one, two, three and then four five six, right, that’s. This is permute. This is pretty making the, uh, taking transpose and this right here is, uh, using view. So what view does is? It’s just gonna do whatever is most? I guess convenient you could say it’s just gonna take one two. That’s the first two elements, and then three, four and five six, right. It’s just gonna take the the elements and then just make them into that shape in the way that’s most convenient. Um, yeah. That was kind of a bad explanation, but hopefully you get what this does. Um, and I’ve made another video where I explain these two in more detail, but anyways, when you’re using view as a way of like a permuting the axis or dimensions, then remember that that might be a flawed way of doing it and you might actually want to use permute number seven using bad data augmentation. All right, so this mistake is one that I’ve made multiple times and hopefully I can save you the trouble of doing the same. So you know, you’re you’re doing some network and you’re training on the Mnist data set because you’re just trying to learn and then you’re googling and you see that people are using, uh, cool data augmentation and, you know that improves the performance, so we’re gonna do the same. We’re gonna do my transforms we can do, uh, transforms dot compose, and we’re just gonna I don’t know use, uh, transforms dot random vertical flip, and we’re just gonna set the probability to 10 we’re always going to vertically. Flip, you wouldn’t want to do this, right. You would set to set this to a 0.5 or something, but then we’re going to do transforms that random horizontal flip, and we’re just going to set the probability 10 as well. This is just because it’s for this example, but anyways, then we’re just going to do transforms2 tensor, right, and you might even use more, Uh, but I’m just using this because this is going to showcase my point in that when you’re using multiple. Uh, when you’re using these data augmentations? This is not doing anything good to your data set, right, The data, the augmentation. You’re using, must be. You must consider what the data set. Is I’m going to show you an example if we do this, we’re going to get something that looks like this, right. This is now vertically, flipped and horizontally flipped. And when you see this. I have no clue what digit this is, right. This is completely changed the digit. So when you’re using when you’re doing the data augmentation, you need to make sure that you’re not actually modifying the target output, because if for example, if you have a nine and you’re vertically flipping that that would act that would change the target output for that digit. So in the end, our network will just be horrible If this is what we’re training it on. So, uh, yeah, be careful with this. Uh, you want data? Augmentation is good, but not all data. Augmentation is good. You need to make sure that what it The data augmentation is doing is actually what you want it to do number eight, not shuffling the data, so another common mistake that can screw up your training Is you’re not shuffling the data. So, um, I mean, well, it’s. I guess it’s nuanced, so in most cases, you want to shuffle your data, for example, if you’re using the mnist data set and we don’t want, you know the in, I mean, 10 batches of only ones and then 10 batches of only twos etc, right. We want the batch to be mixed of all the digits. So what we can do is we can do shuffle. Yeah, for our, so this is wrong. We’re not going to do it on the data set. This should be on the on the loader. So here we’re going to do. Shuffle equals true. Um, and also on the test, we’re going to shuffle equals true, but also this is one thing that can be as I said. It’s nuanced. If you’re using time series data or anything like that, where the order is actually important, then you don’t want to shuffle it right, so be careful with this as well, but in most cases, you would want to shuffle it, so keep that in mind number nine, not normalizing the data. All right, so for this next thing I’ve copied in the, uh, the seed so that we get deterministic behavior again and we’re not going Gonna do soft Max, so remove this, and then so the thing is here. Um, is that people forget to normalize the data? So when you’re doing, uh, perhaps you’re ignoring this part, right, So this is the Amne’s data set where we only have one channel, that’s. Why it’s only one value here, but anyways, you you want the data to become centered with mean zero and standard deviation one so for that you would need to figure out first. So first of all, two tensor divides everything by 255. So everything is between zero and one and then you after two tens you want to normalize and for that you would need to first go through the data set and then check. What is the mean of the data set right now? What is the standard deviation of the data set right now? And then you would do this, right, You would do transform that normalize and then mean equal to that value. You checked standard deviation equal to that value that you that you got? When you check the data set. Um, and, uh, you would have to do this for each channel. If it’s RGB, so for the mnist data set, you just need one value and so let me just run this, and we’re just going to run it for one epoch so that we can see what the difference is so without the normalization, you get 92.24 right, that’s pretty good. If we rerun this now. Using the normalization, we now get 93.04 so yeah, that’s 0.6 percent, no more, right. Yeah, oh, 0.8 That’s that’s a decent improvement just by doing this line right here. Uh, so keep that in mind. This is something that you want to do. It’s it’s less important when you’re using batch norm, but still, it’s it’s important, so remember to normalize your data number 10 not clipping gradients when using Rnns Grus or lstms. So when you’re using Rnns Grus or lstms? And now we have a fully connected right here, but pretend this is in lst’m. Then you would want to do gradient clipping, so you’re going to get. I mean, if you don’t do gradient clipping, you might get exploding gradient problems and you would notice that so you would see that there’s an error, but this might be hard to debug, and might you know, take you some time to figure out so what you want to do is, uh, you want to go down to your training loop and, and, uh, after doing the last dot backward right when you’ve computed the gradients, you want to do torchnnutilsclip grad norm? There are a couple of different ways of clipping the gradients. This is a one way that’s convenient. I guess so you would do modelparameters and you would set some Max Norm here, Max Norm, and well set it equal to one. So, uh, it’s just one one line, but it can make a big difference and, uh, save you some time. So, uh, that was ten. Uh, common mistakes. Uh, let me know in the comments. Which one you, which one did I miss, right. There are many more mistakes. So if you think that that’s one that’s important, uh, write it in the comments and, uh. I might do an updated version this in the future where I include some more so, uh, yeah, thank you so much for watching the video, and I hope to see you in the next one. [music] you!

Wide Resnet | Wide Resnet Explained!

Transcript: [MUSIC] This video will explain the wide residual networks otherwise known as wide ResNet and then frequently abbreviated in papers as wrn - the number of layers - the widening factor, so the headline from this paper is that a simple 16 layer wide. Resnet...

read more