Transcript:
Hey, what’s up, guys? This Akshay from AES Learning. And this tutorial is the second one in our gensim playlist and this is regarding topic modeling using LDA, so let’s get started before starting with the video. It would be really helpful. If you would subscribe my channel, it would really help my channel grow, so let’s get started. So the main agenda of this video is to build a topic modeling by using Lda with the help of Jensen. Okay, so first we will be importing Argentine library. Okay, so this is my corpus. So I would be detecting out the important topics which have been spoken in this purpose. This is a very small corpus. Okay, so, by the way, what does what is a corpus? So corpus is a collection of documents and what is documents document is nothing but a normal text so understand in this way that if you have some amount of text, so that will come into documents and collection of documents will be your corpus, okay, and my main agenda here is to come up with topics which are being spoken in that quartus that is called as topic modeling or topic detection. Okay, so so this is a very basic topic modeling by using LDA, so we won’t be going into vectorizing and all, uh, we we would be using your bag of words itself, Okay, So this is my corpus and what I’ve done here is. I have imported sentence Tokenizer from Nltk. So an LTK is a library for NLP operations and NLTK stands for natural language toolkit and we have imported sentence tokenizer for breaking our corpus into different sentences. Okay, so okay. What is the error? Here me, try once again. Yeah, so I have broken my corpus into sentences and here you can see the list of sentences, so we have around. Uh, one two, three one, two, three, four, five, six seven. So this corpus contains seven sentences and my agenda here is to build a topic modeling topic detection model on the seven sentences. Okay, so the first step, uh, when it comes to topic detection is, it is always nice if you remove words like which are occurring a lot, but they are not contributing enough context and meaning to the topics words like the, uh, of so mainly prepositions and articles. So these words are repeated a lot, but they don’t inherit to the topic detection that is those words don’t contribute to the main parent topic. Okay, so it it is always nice if you remove these words and removing these words is called as stop word removal. Okay, so we will be doing stop word removal on the words here, but before that, what we will be doing is we will break down these sentences into tokens. So what is tokens? Just understand tokens as breaking of the sentence into smaller chunks. Okay, so I have considered here tokens as my word tokens. So your, for example, in this sentence in will be my first token terms will be my second token. Sorry, so I have been using your gensim utilssimple preprocess. So what does this function do as you can see? This function is taking all the sentences as, uh, input and it is performing simple, pre processing of gensim on it. So here we have the parameter of, uh, DX. So what is DX? So DX stands for D Ascent, so in some of the, uh, languages like German and all we have, uh, identified words like we have, we see commas and all in, uh, in in front of those words in, uh, in about those characters, so those, uh, characters represent the ascend of that word, right, so by keeping dsn true, I will be removing all such characters from my tokens. Okay, and I have kept my minimum length as three, so any word or any token, which is less than three will get discarded. Okay, so here as you can see my first token that is in should get discarded because I have kept your minimum length as three. So this is my list of pre-processed data. We’ll try to see this how it appears. Yeah, so as you can see my first sentence as we have kept your minimum length is three. My first token that was in got discarded off, got discarded. Okay, this is how my, um, corpus appears after doing a simple pre-process of Jensen. After that, what I will be doing is I will try to create Pygrams by using collocation technique of Jensen. So what is collocation, so in simple backgrounds of an of an ltk? What happens is, uh, if we have a sentence like Akshay is teaching an ltk. So it would create backgrounds like Akshaya is is teaching and teaching Nltk right so it will consider all the words and it will create the pair of all the words with its predecessor and its successor. But what happens in collocation? Is it try to make sense whether the word in the predecessor and the successor does it have sense? So how does it comes with the sense? It comes with the sense with the count of those words, the occurrence of those words. If those two words are occurring a lot, they get concatenated in the form of a diagram. And if not, then those words are kept in the corpus as single word itself. Okay, so this is that is my task of bigrams, so I’ll give you an example now for in this, uh, corpus as the corpus is very small, so we won’t find frequently occurring background words, so most of the words would be kept single only, but if I have a corpus in which there’s a a lot of occurrence of two words like sunny day sunny day, so what happens is if in my corpus sunny day is occurring a lot of time together, it would concatenate into sunny underscore day and that would be treated as my diagram. Okay, so the next step here is lemmetizing and stop for removal, so we have already removed some of the words by using minimum length, but if you want to removes unwanted other prepositions and unwanted other words, which don’t contribute a lot to the topic, so we can use toffers here so here, I’ve simply imported the stop words from Nltkcorpus and I am using the lemmatize function of Jensen Util’s. Okay, I’ve created a set of English language. Stop words, and here I’ve written a function so what I’m doing in this function is. I am at first removing the stop points here in my first statement in second statement. I’m applying the diagrams, okay, on that corpus and in third statement, I am limitizing it, okay, along with limitizing, I’m only picking the proper nouns here, okay, because NN stands for proper noun, and I’ve kept my minimum length as five. Okay, So what so what is Limitization? Limitization stands for bringing the word to its dictionary form. Okay, so if you have, uh, words like, uh, playing so playing would get lemmetized to its dictionary form as play. Okay, so we, uh, by so, what are the benefits of limitization? What happens is playing played, player? They all are giving us. The hint of play right, so play is more important there, so that is why we try to focus more on play. That is why we prefer limitizing it. Okay, so I’ve created a function here, and then I’m calling out this function okay now. My corpus is ready. Now what I have to do is I will create a dictionary. I am, I’ll be importing your LDA. Because I’m using your LDA. You can use other models as well. We have LSI. We have RP. We have hdp so there are many models. But, uh, Lda is very popular, and it gives a really good results so of my train test. I’ll show you the train test how it appears. [APPLAUSE] Yeah. Resume train test appears here. Okay, I’ll be creating dictionary. What does dictionary do in a normal dictionary? We have entry or each word, and there’s a unique entry, right, so dictionary is created for unique entries. Okay, let me print it. Yeah, so I have a dictionary of 20 four unique tokens. So in my corpus, there are 24 unique tokens, Okay, and from this dictionary, I’ll be creating a corpus and I have used your bag of words. Talk to back of words method for creating a corpus because my model won’t understand String, right, it only understand numbers, So how so how does my corpus appears? I think I printed corpus here below. Yeah, so this is how my corpus appears now. What are these numbers exactly? Now try to understand here. I have 24 unique tokens, right, so each word has been assigned a unique. I’d 0 1 2 as you can see this. This number is up till 23 0 to 23 and in that typical sentence. What is the occurrence of that word? How many times that word is occurring that is indicated by the second number, so token zero is appearing one time. Token one is appearing. One time here you will have entries here. Token five is appearing two times in this way right now. This corpus will go input to my lda model. I am keeping my, uh, parameter number of topic as three and I have. I’m importing! Sorry, I am also passing my dictionary to the parameter of ID to words. I’ll keep here number of topics less because my corpus is very small and lets. See, what are the topics detected so it has detect so in this two. So what happens is What does Lda does? Is it try to map the entire corpus into two topics right and here in this two topics of zero and one, it has picked up some most contributing words, and these are the probabilities which represent the contribution of that word towards that topic. Okay, so let’s see we have here model. Then we also have seller, so let’s check our corpus once. And do we really have model in it? Yeah, as you can see a models then, uh, there’s only one occurrence of model. I guess, but I think it was contributing a lot and even it makes sense like this is 74 this this, uh, topic seems regarding any car model. It makes sense, and if a model Ford Ford tutor standard here, there’s one more time model is appearing here standard. Yeah, so it makes sense. The contributing words make sense, and that is how my entire corpus is being divided into two topics. So if you want to see this into a graphical form, so you can import by Ldavisjsim enable the notebook and you can have an entire analysis of this LDA topic detection in the form of a beautiful graph. Yeah, so as you can see, we had two topics here. Which word is contributing most to the topic, right, then you can, uh, select your your relevance parameter and play with the words. Okay, yeah, so that’s it, guys. This was it with this with the video. If you like this video, give it a thumbs up and subscribe to my channel as learning for more such amazing content.