Transcript:
Ladies and gentlemen, welcome back for another. Pi torch video in this video. I want to show you how to code a simple Arnon As well as how to code a GRU or LS tiem in Pi Torch? So all I have here is the code for a fully connected neural network that we coded in a previous video, But I’ll just recap it quickly. If you haven’t watched that one, so all that we have is a very, very simple, fully connected net. That’s yeah, training on the M Nice data set, so we’re just loading M Nest. We’re initializing the network OPTIMIZER and we have some training loop and then in the end we’re checking the accuracy of how good our model is, so that’s really all that we have, and lets so. I didn’t want to repeat the code for all that. I’ll check out a previous video. If you want to see more of that, all right in this video. I’ll just focus on the RNN, so lets. See, first of all what we want to do Is we want to change our hi parameters and that we can remove this right here? We’re gonna create a. RM and first thing we want to do. Is we want to change our hi parameters? So when we load the? M Nice data set. Let’s see I think when we load the amnesty to set so the shape when we load Eminence data set, it’s going to be 64 I mean, I guess I batch size. Let’s say N by 1 by 28 by 28 and so what we can view this as is that we have 28 time sequences and each sequence has a 28 features. OK, so that’s sort of how we can view the RNN working in this case, and I also want to add that normally you wouldn’t use an RNN for images, but we can just, we just want to kind of learn how to how to create RNN, so we can use that input size should be 28 and we can say that the sequence of length is 28 so we’re sort of viewing. I guess we’re taking one row at a time and that’s what we’re sending in to the RNN at each time step, and then we’re gonna have a number of layers to our RNN. Let’s say we have two and let’s say we want hidden size to be five 256 nodes in the in the hidden and let’s see the learning rate is still that, and then yeah, we can still have. Let’s say number of E-books is two. That’s really all that we want for I parameters, and we’re gonna you’re gonna see why we need those, so let’s do class. Rn N and the dot module and then module like this. We’re gonna have our init function and so what we’re gonna send in here is first of all the input size, the hidden size, the number of layers and also the number of classes. Okay, first thing we’re gonna call a super. Rnn self in it. Yeah, so what we’re gonna start with now? It’s just a very basic Arnon, And then we’ll take it to the GRU and in Lst M First thing we’re going to do is self dot hidden size is gonna be hidden size and self-tan on layers. It’s just gonna be known layers and then we’re gonna define self dot RNN, which will be N N Dot R n N and it’s gonna be so the input size is gonna be input size, and that’s sort of write the number of number of features for each time step. Okay, so we don’t have to explicitly say how many sequences we want to have the. RN. N will just work for any number of sequences that we send them. Just in this case, it will be 28 sequences and then we’re gonna do hidden size. That’s the number of nodes in each time step and, Lastly, the number of layers for the RN N. And one other additional argument that we’re gonna do is batch first equals true. Yeah, so since we the data set that we load their mistake. I said, is gonna have the batches. As the exist axis, then we need to say batch first equals true. Yeah, you can read more about in the like. I told documentation for how they expect the input to be, but if we write batch first equals true as we do in this case, we’re gonna have, so we need the input needs to be the number of batch the batch size first, and then we’re gonna have time sequence and then it’s gonna be time times features. Okay, so that’s just what we’re gonna send in in this case, and then let’s see, so we’re gonna also have a fully connected at the end, so we’re going to do NM dot linear and what we’re gonna do here is we’re going to do the hidden size and we’re gonna do time’s sequence length and then number of classes so here what, I as I said, we have 28 time time sequences, right time steps and what we’re gonna do is we’re gonna concatenate all of those sequences, and that’s what we’re gonna send into the linear layer, so it’s gonna use information from every hidden state. You could also just take the last the absolute last hidden state, and I’m gonna show you in the end of this video. How to do that that as well, but let’s just start with this one, and so now we’re down with the initialization. That’s the RNN that’s the linear and then we’re going to define forward self comma X, and we need to sort of initialize the hidden state first, so we’re gonna do hidden state. I guess we can do h0h. Torch torch that zeros and then self that num layers and yeah, so the hidden state here needs to be initialized as the number of layers first and then X dot size and zero. So that’s sort of how many mini bashes we send in at the same time and then self dot hidden sighs and they were just gonna do Dot two device and then so we’re gonna do forward. [MUSIC] For for prop so forward rap, we’re going to do self dot RNN and we’re just gonna send in X and the hidden state and then we’re just gonna do out and then what would what would be? The output here is just the hidden state, but since we’re not going to store the hidden state since every example, has its own hidden state. We’re just gonna ignore that that output, and then what we’re gonna do is going to do out out that reshape and then we’re gonna keep keep the batch as the first access and then we’re just going to concatenate everything else so what this would be is. I guess 28 times, so the sequence length, right, 28 times the hidden size, which is 256 and then we’re just gonna do out equals self dot FC of out, right, so we just pass it through the linear layer and then return out, and I think that should be it lets. See, we need to do aren’t in here and we need to send in all of these things, so let’s just change to this, so we send in the input size, the hidden size number of layers and number of classes. Okay, and we define those here in the high parameters. The rest of the code should not change so we should be able to run this now and we do not, so let’s see what’s wrong. Input Must have three dimensions got to Yeah, so yeah. I know what what’s wrong here. As I said. The Emnes dataset has one by 28 by 28 but the Orion expects this kind of shape so N Times 28 by 28 So what we got to do Is we got to do Dot squeeze and then one, so this will remove the the one for that particular axis, so that’s. X is one and we’re gonna just remove that one, and hopefully it should work now. Yeah, all right, yeah, so we also have to lets here. We can’t have this. And this is from the previous fully connected, so that needs to be removed. And I don’t think there should be anything else now, so I’m gonna let it rain and I’ll get back to you when it’s done. Alright, so it’s done training and we get about, so we get Ninety seven point five percent accuracy on the training and ninety seven point, twenty eight on the test set, which is actually quite good, right. We just trained it for two epochs, and and it’s just a basic basic. Rnn one thing. I forgot to mention is that we need to do the same thing here. Dot squeeze of one when we do the check accuracy. But yeah, it’s just a a detail. So now let’s see if we can improve on this result by changing this to a GRU. Instead, so what we can do is we can do N N Dot GRU, instead of just a basic RN and yeah, we really don’t have to change anything else, except that so we can just change sub top GRU instead, and that should be all we have to change, so I’ll rerun this and we’ll see what we get. So after letting me train, we get, so we kind of see here that we got a little bit of an improvement. We got ninety eight point. Forty one on the training and ninety eight point ten on the test set. Now, let’s change this to an LST M instead and what we need to do. Then is we need to do N N dot LST, m. And yeah, let’s do yourself that LST M and now what we need to do Is we need to actually have a separate a separate cell state, so we’re gonna torch that zero’s self dot num layers. Because if you remember the Lst M sort of has a hidden state and a Cell State, that’s not the case for a GRU or basic owner, but for an LCM, we need to define a separate one kind of the same as the hidden state and what we’re gonna do is we’re gonna send in self at Lst M. We’re gonna send in H Zero Comma C zero, so they sell hidden state and the sell state as a tuple in the second argument, and that’s really all. We need to change, so I’m gonna run this again. See what we get, all right, so we get comparatively this similar results as the GRU. In this case, the GRU is actually outperforming the LST. M and yeah, I guess in practice you. Most commonly see the LSDM performing better but really, they are comparable. And, and, yeah, there’s really no none of them that are better than the other, but I think using an L stem is a good default choice, but let’s see what I want to do now. Yeah, so I I said that now. We’re kind of using information from every hidden state, but perhaps sort of just using the last hidden state is is okay, right, because the last hidden state has information from all of the previous ones, so what we can do for that is that we can just remove the end and, uh, for every so it doesn’t. We don’t need to do this concatenation of all of the hidden state and so we’re just taking the last one and what we’re gonna do then so we’re gonna remove this reshape and we’re gonna do so out here is gonna take all mini-batch all training examples at the same time, and then it’s just gonna take the last hidden state, and then it’s gonna take all features. Okay, so that’s really all we need to change just for it to take a specific hidden state in this case, the last one, of course. I like just thinking about it. We’re losing information by doing this, So the result is probably gonna be worse, But perhaps in a few cases like just taking the most relevant information and training longer on that one is better than taking all information, so lets. See what we get? Alright, it seems that I lied. I’m not really sure how it’s becoming better, but it seems that the it’s performing better now, when just using the last hidden state, I really just think that’s a matter of training longer, but yeah, that doesn’t really matter that much, so that’s it, that’s it anyways. That’s how you would use just the last hidden state, and, yeah, that’s all for Rnn and GR use and Elli’s TMS in the next video. I’ll show how to do a bi-directional. Alice TM. Yeah, if you have any questions. Leave them below! I think you so much for watching and the hope to see you in the next video [Music].