Transcript:
Hello, people from the future. Welcome to harmonized nerd in this video. I’m gonna be talking about two very important concepts in NLP that are text, summarization and keyword extraction well. I will be doing these things. Using a famous library called Ginseng. And obviously I’m gonna explain you how this thing actually works. If you don’t want to miss any future videos, then please subscribe and hit the bell icon. So let’s get started, okay in this video. I’m gonna do the things a bit differently, Usually. I first explain the concept and then jump into the code section, but in this video, I’m gonna be explaining the concept and the code simultaneously. Let’s see how it works, Okay, As I told earlier, I am using Jenson here. This is probably the most famous deep NLP library. Okay, and we need two functions summarized and key words, and both of them belongs to the same module. Well, this is because both of them applies the same algorithm. In one case, we just consider the sentences, but in other one we consider words. Well, when I’m gonna explain the concept. This thing will be more clear to you, okay, so to begin with. We need some text, right, because if we don’t have the text what we are gonna make the summary of and how can we extract keywords If we don’t have any text? So I am taking this text okay well. This is just the description of my Youtube channel. Nothing fancy about it. So then I’m taking this whole text as a single string. You can see that this is a long string, and after that, we just need to do this. We just need to pass two arguments to our summarized function. The first argument that is the string is the compulsory argument and every other argument are optional. Here I am passing an argument called ratio here. The ratio is 0.5 which simply means it will produce a summary that will be half of our original text. But how do you define the size of a text? Well, there can be two main ways. The first one is the number of sentences and the second one is the number of words. It turns out that if you use the ratio argument, then it will produce exactly half of the number of sentences that of the original text. So in this text, we have six sentences so this function will produce exactly three sentences. Okay, so the first one is. I love to create educational videos on machine. Learning and creative coding, true enough and the second one is machine learning and it designs have to change our world grammatically, and we’ll continue to do so absolutely and the last one is the. If you like my videos, please subscribe to my channel. Please do that. Okay, and you can notice. One thing is that it is not creating any text of its own. It is just extracting the sentences from our given text. This kind of summarization is actually called as extractive summarization. There can be generative summarization. Also, in that case, the model will come up with its own sentences that is more advanced concept and it is a topic for another video. Okay, but how the hell it pulled out these three sentences. Obviously there needed to be a logic, right, Well, let’s explore how this algorithm actually works. Don’t get afraid by this huge diagram because I’m gonna be explaining each and every step very clearly. Okay, so first of all, we have our raw text, and if you have followed my NLP series, then you should know that no text is of no worth. We have to first clean our text well. This step is also called as text pre-processing after that. We have our clean text well to get to the clean text. We use many techniques, for example, we remove capital letters. We remove punctuation X. We remove numbers and many things like that. Okay, after that, we are dividing our whole text in two sentences. Well, why because you have seen that we are generating the most important sentences in our summary, so we must know the sentences that are present in our text. Okay, so we are just splitting our whole text into sentences after that we need to tokenize the sentences. Well, tokenizing simply means we need to divide the sentences into its words. So this is the first word of first sentence, and this is the last word of last sentence. Okay, so the tokenization is done, then we need to convert this sentences into vectors. Yes, into numeric values because computers don’t understand this words, it has to be converted into numbers, right, but how can we do that well? There are actually many ways of doing that. The most effective way is to use the word vector, and I have explained this concept of word vector in detail in my videos, so I will highly recommend you to check that one thing you will notice that what vectors are applied for words, right, One word will be explained as vector, but here we are representing the whole sentence in two vectors. How can we do that well? One simple trick is to just open the word vectors. So after appending the word vector of each word in a sentence, we will have this long sentence vector, and this is the first sentence vector, and this is the last sentence vector. Well, another way of generating this vector sequences is to use something called. Tf I D. F that is also. I have explained in my previous video, but it is really not recommended to use TF IDW. If you can use what vectors because what vectors perform very well compared to this TF IDW thing, okay, after that, we need to create a similarity matrix. Don’t get afraid by it. Because the similarity matrix is just a matrix of N rows and N columns. What is the? N you may ask well. N is simply the number of sentences that we have in our text, okay, and each entry in this matrix is just the similarity between two sentence vectors. For example, If you are looking at the entry at I row, it will just denote the similarity between Id sentence and J sentence. Now there can be multiple ways to compute the similarity. The famous one is the cosine similarity. Okay, you can also use the Euclidean distance or any other kind of distance. If you want now, we need to create a graph out of the similarity matrix, but how can we create a graph and what will be the edges and the nodes? Well, for text summarization we just want to extract the sentences, right The most important sentences, so the nodes will be sentences. Okay, and the edges will represent the similarity. All right, and one thing. I should really point out that this graph is not a complete graph, but if you want to create the graph from the similarity matrix, then it will be a complete graph that is every node will be connected to every other node because we can’t find the similarity between every two sentences. Okay, so this graph really should be a complete graph now comes the most interesting thing we have to rank these nodes, but how for this we use something called as page ranking algorithm well? This is the billion dollar Web page ranking algorithm that Google uses well. Obviously, the modern browsers use a very refined version of this algorithm, but the core idea is the same and this algorithm was developed by Larry. Page and the fun fact is the algorithm is named after the Creator. Not the work it does. That is page ranking. Okay, but how can we use this here? You can see here that. In this original algorithm, they treat the web pages as nodes and the similarity is just the similarity between two Web page links and defined something as score for each Web page. The score simply denotes the probability of an user to click on the link, but in our case, the score will represent the importance of a particular sentence in that text. So after getting this course we can rank each sentence just like that, and we just need to output the top key sentences. Isn’t that amazing? And this whole process is actually known as text ranking algorithm? [MUSIC] Let’s get back to coding. Okay, let’s see what are the other arguments that we can pass to our summarize function? Okay, so here you can see that we have argument called split, and if we make this argument true, then it will return a list instead of a string. Okay, now we are going to use the next function that is the keyword’s function. This will help us to extract the keywords so you can see that. I am extracting five keywords here. And the keywords are educational machine coding future exactly. But how does this keyword extraction works? Well, let’s get back to our conceptual portion. The thing is very simple here. Instead of considering the sentences as nodes, we can just consider the words as nodes. So after doing the page ranking algorithm here, we will rank the words instead of the sentences, and that’s how we can output topmost key important words. Isn’t it amazing how the same algorithm can be used to generate summary and extract keywords? Okay, so that was a very small example, but now let’s do this thing into a larger example, So here is my large example well. This is the last chapter of my most favorite relic home story. That is the Hound of Baskerville, written by Sir Arthur Conan Doyle and I will definitely provide this link in the description. OK, so first of all. Let me show you the text. It is a big text as you can see here. It starts from here and it is very long. Yeah, it ends here, OK? And I have just saved it as Hound Dot. TXT so first, I need to read this file and to read this file using Python. I’m first opening this and make sure to make the encoding utf-8, and then we are gonna use the standard dot read function and we are going to neglect the space between two line. Okay, so here. I am using the summarize function with a ratio of 0.1 because obviously it is a huge text. And I don’t want to producer somebody too long, so I am using only 10% of it, okay, so this is the 10% summary, and we can also use another argument called what count so this will just give you the summary containing these many words. Okay, so that’s just another thing of producing summaries using Jenson and the last thing I want to show You is the key words on this text. Now you can see here that I am producing 30 keywords and one important thing is that I am using a parameter called limit eyes equal to false. Well, this means it won’t perform the limitation, but what Islamization just look at these examples, Stapleton, stable tones. Well, these two are actually coming from the same root word. If we don’t use Lemma’s ation, then both of the words will be treated as two different words, but it really shouldn’t be doing that. It should really output only stable tone, so in the next example. I have made the limit eyes argument. As true, so it will perform minimization, and it won’t take two words. It will just treat Stapletons okay. And similarly, in other cases, also, it has just considered only one instead of two words. Okay, and if you just go through the words, you will actually notice that it really works because Stapletons is an important thing in the story. Baskerville is definitely the important thing in the story. Hound important, so you can see there. How good the Jenson keyword extraction is. So that was all for this video, guys. If you have enjoyed this video, please like this video. Share this video and don’t forget to subscribe to my stay safe and thanks for watching [Music].