Transcript:
[MUSIC] What’s going on? Guys hope you’re doing awesome and welcome back for another pie tricks video. So I’m guessing most of you already know what Torch X is about, so most of you probably searched and clicked on the video, but for those of you who don’t know? I better explain what it is. So for NLP tasks, there’s a lot of pre-processing needed when dealing with text data and the tour text is a powerful library that can solve a lot of these pre-processing that we need to do so one generic question is. What kind of pre-processing do we want a library to be able to do for us? And so here we have a generic list and we can see that tour. Tex checks a lot of these things like file loading tokenizing, creating the vocabulary batching and the – that it wouldn’t handle is us playing the data and the embedding lookup, which I’m gonna show probably not in this video, but another one and in this video, we’ll just focus on the functionality of torch text, but here’s a a quick overview of what we want, so we have a text. Let’s say the quick Fox jumped over the lazy dog, and we would first want to tokenize this, so it would become a list of the quick fox, etc, and then we want to create a vocabulary and we would want to map each word into an index and then when we have the vocabulary, we would want to take this example again and we would want to numerical eyes it so that we get a vector of just the vocabulary indexes for each word and this might sort of the final last step would between him do an embedding lookup, so for each word, there would be ad dimension embedding vector that represents this word and this could also be pre trained embedding’s like like Club vectors, so that’s sort of the overview. I think we’re ready to go into the code now, so let’s first start with the imports, so we’re gonna do actual. Let’s do it. Let’s do sort of the steps. We’re gonna do so that we are clear on that. The first thing we want to do Is we want to specify how the pre-processing should be done? Okay, and this is gonna be done with fields. The next step would be. We want to use a data set to load the data and this would be done. And I guess this would also do the numerical izing and wish me which we looked at. And this would be tabular data set in torched. X and this would handle a couple of different data set data set files so it would handle Json Csv Csv files, which is we’re gonna show all of those in this video and then the last step we want to construct an iterator to do batching and and padding so this would be done with bucket iterator, all right, so these three are what is what we’re gonna use and so we can import them. We can do from torch text dot data, import field tabular data set and and also bucket iterator. So before we continue. I feel it’s important to also show you how the data should be structured so here. I have some train and test data and it would also work. If you would have validation data, if you know another file, but here we have sort of all the formats. We have a CSV file. We have a JSON file and we have a TSV file and let’s say we open this JSON file and so what I have here is just some very simple toy data. We just have three examples and the idea is that we have some quote, and we want to say sort of. How motivational is this quote so we would have? Jocko said you must own everything in your world. There is no one else to blame, and we rate this a score of one because this is highly motivational, and then we would have some some something else like random potato said, stand tall and rise like a potato and this is score zero and so sort of when you have data, you could have some that aren’t really relevant for the data like this. In this case, the name wouldn’t be relevant, but this right here, the quote and the score, those are the relevant ones, and so this is how it would look like. If it’s a JSON file, you would have a dictionary for each row in the in the Json file and, and yeah, so that’s how it should be structured if it’s a JSON file. If we have a CSV file, sort of what you expect, just the name quote and score and you would have sort of under those columns and the same for same for the TSV file, except now it’s separated by tabs. Now that we have the necessary functions that we’re gonna use, Let’s start with doing the tokenize so this could be done in several different ways. You probably wouldn’t want to do it in this way, but it’s just a simple way and I’ll show you how to do it better later on in the video, so we want to do just a tokenize Lambda X X dot split, so it’s just going to split it where there’s a space, and then we want to do. Let’s say quote, and we’re gonna call the field and we’re gonna say sequential equals true because the data is sequential and then we’re gonna use vocabulary is true, and we’re gonna use Tokenize equals tokenize so the function that we define and then lower equals true And this is gonna make everything lowercase so here we specify sort of how the data should be pre-process, right, that’s. The first step it’s gonna be in lowercase, and we’re gonna use this tokenizer. Then we have the score. It’s also going to be a field, but sequential isn’t is equal to false and useful. Capillary is false. Since this is, this is a. I guess this is an example of sentiment analysis, but the things we go through really are generic so that this, for example, could be true if would be a translation on data set or something like that, and we would construct fields to be a dictionary and here we’re gonna specify which columns to use in the data sets in the data set. Rather, so we would. You want to use quote and this would be so remember. We had a name, but it wasn’t really relevant. So if we don’t mention name here is just gonna ignore it and so we’re gonna use quote and we’re gonna use score. Those are the two that we feel are important and what we can do here is we can do, we can do just queue and then quote and this quote here is gonna be for this field, So this specifies how this column should be processed using this field and then the score here we’re gonna do s and then score and what this cue is for right here. Is that later on when we create the batches? It’s gonna be so how we get the the quote is gonna be Batch Dot. Q and to get the score, we would do Batch Dot S. So sort of. I guess making this more compact. I guess we could also use the same quote and score button just to show that you could change it, and then so after that, we want to do the data set, so we want to do tabular data set, and we’re gonna use dot splits and so here we’re gonna do path equals my data that sort of the folder that I have the data in, and then we’re gonna do Train equals Train. Json and test equals test Dot Json format is JSON and fields is fields and then what this would do is it would return a tuple of the train data and test data like that, and, yeah, so this is how it differs. If you would want to use the for CSV, you would essentially just copy this and you would just change this to the CSV CSV CSV, and if you would want it for for the TSV, it’s sort of the same, just change CSV to TSV, and also one thing I want to add here is that you could also do now. We don’t have a validation set, but if you would have, you can also do like this validation equals validation Dot JSON, for example, So yeah, this is how you would. Let’s comment these, but this is how you would use it for for. Json, Csv and TSU. Let’s just use the JSON file and so we could do. You could do print train data and just a single example, and we could do a dot dict of keys and the values and this would give us the think they quote and the scores let’s see. George tags is not a package. Oh, man, so that error took a very long time to find the problem. Was that when having me there when I created the file? I named it torch text, which was a very, very bad name since the package is also called torch text. So it, of course called. It caused a problem, so I just changed it to something else and yeah. If we now continue and we run this, we can see that we get the keys right here. Q and s for the quote and the score, and we also get the quote right here and the score of that quote. So that’s a good start, but we want to do now is let me just remove that and what we wanted to do. Now is we want to build a vocabulary and we’re gonna build it for the quote quote field and we’re gonna do build vocabulary on the training data and we could also do something like Max size of the vocabulary. Let’s say we would have a maximum of 10,000 of course, in this case, we just have like 50 words in total, but just for illustration, and and yeah, so lets we could also do something like minimum frequency to be two and which would only include words that are has a frequency of at least two in our data set, but we can set it to one in our case, and, yeah, so what we want to do now, the next step of our of our steps right here is we want to construct an iterator to do the batching and the padding, so we’re going to do. Train iterator context. Iterator is equal to bucket iterator dot splits, and we’re gonna do a tuple train data comma test data batch size equals. Let’s say two and then device is gonna be CUDA, so this is just gonna split our training data and test data into iterators, so nothing, nothing difficult here. I guess what it’s gonna what it’s gonna do as well is. It’s gonna do the padding for us which we’re gonna see soon so we can afford the for batch in Train. Iterator we can do print batch dot. Let’s say queue and we can do, lets. Just run that and we’ll see right here that we have sort of a batch here with two examples and then the last batch, which is one example since we just have three in total and one thing to notice here is that they are all of equal length in this patch and the ones here, stand for the pad token. So that’s what it does for us, right, It’s the power of torch tags. It’s gonna do some of these are padding etcetera for us, and we could also do print batch. Q Partial s rather, and this would also include the score for each quote and so lets. Now try to see. This is sort of a good step for us, right we. This is what we want. We can do a few minor improvements on this, which I’m going to show now, so we’re gonna do import Spacey, and then we’re gonna do. Let’s see so our goal right now is to remove this tokenize. This, as I said in the beginning. This is a bad tokenizer, and we want to have a better one. So what we’re gonna do is we’re gonna import spacy and you can get that from Pip install spacey, and then one more thing we’re gonna do is we’re gonna do spacey comma and for English. I’m gonna load the English vocabulary test and you can get this from from this right here, and then we want to define tokenize of some text, and we want to return an array of Tok text for Tok in Spacey and tokenizer of that text, and then we can just remove this old one, so this is gonna make some improvements. This is a better tokenizer and especially for other languages than English. This is gonna be a better better one. So you would, for example. If you have other language, you would load that vocabulary as well and let’s see there’s also one more thing. I want to do and you can also have pre-trained word embedding’s using torch text. So you can do. Let’s see you can do vectors right here. In the building, a blur and we’re gonna do Glove thought 6b dot 100 D. So this is a pre trained glove vectors trained on a data set of 6 billion words. I think and it’s in a hundred dimensions, and I think so. If before, you run this, this is gonna be like one. Gigabyte emphasized. So just be careful before you run this. If it’s gonna take a little bit of time, if you have a slow internet connection, you would actually need to do one more thing for this. You would need to transfer those weights onto the embedding of which is defined in the network. I’m not gonna cover that in this tutorial, but I’m gonna have code for it on my Github. It’s literally just two lines. You have to add after you have created the model, but then we would have to sort of create the network and all of that and I don’t really feel That’s the point of this video, but it’s on Github. If you want to check that out, so for this video, we covered how to use torch text to do the pre-processing if we have a custom data set in the next video. I’ll show the built-in data sets that torch text has to offer and how we can load them. Thank you so much for watching this video, and I hope to you in the next one [Music].