Transcript:

Hi, everybody! Welcome back to a new. Pi torch tutorial today. I want to show you the Pi. Touch data set and data loader classes so far, our code looked something like this, so we had a data set that we loaded somehow, for example, from a CSV file, and then we had our training loop that looped over the number of epochs and then we optimized our model based on the whole data set, so this might be very time consuming if we did gradient calculations on the whole training data, so a better way for large data sets is to divide the samples into so-called smaller batches. And then our training loop looks something like this, so we loop over the epochs again, and then we do another loop and loop over all the batches, and then we get the X&Y batch samples and do the optimization based only on those batches. So now if we use the Built-in data set and data loader classes from Pi Torch, then Pi Torch, you can do can do the batch calculations and iterations for us, so it’s very easy to use, and now I want to show you how we can use through these classes, but Before we jump to the code, let’s quickly talk about some terms when we talk about Vetch Training. So first one Apoc means one complete forward and backward pass of all the training samples and one the batch size is the number of training samples in one forward and one backward pass and the number of iterations is the number of passes where each pass uses the batch size number of samples. So here we have an example if we have 100 samples and our batch size is 20 then we have 5 iterations for one epoch because 100 divided by 20 is 5 So yeah, that’s what we should know, and now let’s jump to the code so first. I already implemented some modules that we need so torch, Of course, then also torch vision and then from Torch Dot utils, the data we import data set and data loader. So the classes. I just talked about then. Let’s also import numpy and math, and now we can start implementing our own custom data set, so let’s call this wine data set and this must inherit data set, and then we have to implement three things. We have to implement the init with self so here we do some data loading, for example, and then we also must implement the double underscore get item method, which gets self and an index so this will allow for indexing later so we can call data set with an index 0 for example, and then we also must implement the Lang method, which only has self, and then this will allow that we can call length of our data set. So now let’s start so in our case, we want to look at the wine data set, so I have the CSV file here, and I also put this in my Github repository, So you can check that out here And so that data set looks like this. The first row is the header and here. We want to calculate or to predict the wine category, so we have three different wine categories 1 2 & 3 and the class label is in the very first column, and then all the other columns are the features. So let’s load this and split our columns into X and y so here we can say. X y equals numpy dot load txt and here. I must specify the file name, so this is in the data folder and then I have a wine folder. And then it’s called Wyandotte CSV. Then let’s also give a delimiter equals a comma here because this is a comma separated file. Then let’s also give it a data type and so let’s say data type equals Numpy dot float32 and that’s also say skip rows equals 1 so we want to skip the first row because this is our header and now let’s split our whole data set into X and Ys, we say self dot X equals and here we can use slicing so XY, and we want to have all the samples, and then we don’t want the very first column, so we want to start at the column number one and then go all the way to the end, so this will give us the X And then self dot Y equals XY off and here again. We want all the samples, but only the very first column, and we put this in another array here so that we have the size number of samples by one so this will make it easier later for some calculations. So yeah, and that’s also convert this to a tensor, so we can say torched or from numpy, and then if this to our our to the function here, so torch dot from numpy, so we don’t. You do not need this, but we can do. We can also convert it later, but we can do it right here, so let’s do this and let’s also get the number of samples, so let’s say self dot. Number of samples equals XY dot shape and then 0 so the first dimension is the number of sample, and then we can return this right years and this is our whole link function so return self dot number of samples and here we can also implement this in one line, so we can say return self dot X of this index and then self dot Y of this index, so this will return a tuple and yeah, now we are done so this is our data set that we just implemented, and now let’s create this data set, so let’s say data set equals wine data set, and now let’s have a look at this data set so now we can say first data equals data set, and now we can use indexing, so let’s have a look at the very first sample, and now let’s unpack this into features and labels like this. So this is first data and now let’s print the features and also print the labels to see if this is working and, yeah, so we have one feature column or only one row, so this is one row vector and then the label, so the label one in this case, and yeah, so this is how we get the dataset and now let’s see how we use a data loader so we can say data. Loader equals the built-in data loader class. And then we pass. We say data set equals this data set and then we can also give this a batch size so batch size equals. Let’s say four in this case, then let’s say shuffle equals true, which is very useful for training, and so this was shuffle the data, and then we also say workers equals two. So you don’t need to do this, but this might make loading faster because it’s using multiple sub processes now, and yeah, so now let’s see how we can use this data loader object. So now we can convert this to a iterate iterator, so let’s say data eater equals eater data loader, and then we can call the next function so we can say data equals data Eater dot next, and then we can all again unpack this by saying features and labels equals data and now let’s print the features and the labels, if so, see if this is working and, yeah, so here, we have it and here in this case. I specify specified the batch size to four. This is why we see four different feature vectors here, and then also for each feature vector the class so for class labels in our labels vector or labels Tenza. And now we also can iterate over the whole data loader so and not only get to the next item. So now let’s do a dummy training loop, so let’s specify some hyper parameters, so let’s say num epochs equal eat box equals two and then let’s get the total number of samples, so total samples equals length of our data set and now let’s get the number of iterations in one. Epoque. So this is the total number of samples divided by the batch size divided by four. And then we also have to to seal this math. See ya this, and now let’s print our total samples and the number of iterations and then we see we have 178 samples and 45 iterations. So now let’s do our loop, so let’s say for a POC in range number of epochs, and now we do the second loop and loop over the train loader, so let’s say for I and here we can already unpack this by saying inputs and labels in and number eight and here we only put in the. How did we call it Data Loader? So this is all we have to do, and now this enumerate function will give us the index and then also the inputs and the labels here, which is already unpacked. And now what we should do typically in our training, is to do our forward and then our backward pass and then update our weights. So this is just a dummy example, so in this case. I only want to print some information about our batch that we have here, so let’s say if I plus 1 Modulo 5 equals equals 0 so every 5th step, you want to print some information, so let’s print epoch and here, let’s print the current epoch, and then all Epoque, so here, let’s say num epochs and then let’s also print the current step so step, and this is I plus 1 and then the total steps. So this is N iterations here, and then let’s also print some information about our input so inputs, and let’s say here we want to print inputs dot shape only, and, yeah, now let’s run this to see if this is working and, yeah, so here, we see our print statements, so we see that we have two epochs, and in every epoch, we have 45 steps and every fifth step, we print some information, and we also see that our tens are is four by 13 so we have our batch sizes four, and then thirty different features in each batch and, yeah, so that’s how we use the data set and the data loader classes, and then we can very easily get the single batch single batches. And, yeah, of course pie. Taj also has some already Built-in data sets, so for example from torch vision dot data sets dot M nest. We get the famous. M missed data set. And for example, we can also get the fashion. M this data set or the cipher and a data set or the Coco Data set. And yeah, so the M. This data set is one that we will use in one of the next tutorials, and for now this is what I wanted to show you about the data set and data loader classes. I hope you liked it and please subscribe to the channel and see you next time bye.