Transcript:
Hello. I’m Doctor Ileana Moran today. We’re going to discuss two different algorithms that are somewhat comparable in natural language processing. These are tf-idf! Stanek return frequency times inverse document frequency and words to Veck and Dr Vac variants on the same theme that are more recent algorithms and typically useful for larger corpora now. The reason that we’re having this little discussion, which is sort of a bonus. Add on to what I plan to do. In terms of going through the mathematics of the different methods is that students will typically run both algorithms on the same corpus, which is the right thing to do and they’ll report the results and they’ll say this worked better for me than the other or just the right thing to do. They won’t always have a sense of intuition like they can’t always articulate. This is why a certain algorithm performs better for me and so in this very short little vid, we’re going to drill down Just a little bit into the Whys and wherefores you running. An algorithm on a corporate corpus of your choosing might find one algorithm more suited to your needs than another lets. Jump to the slide deck, shall we? This video is part of a series on natural language processing in this vid we address text vectorization specifically, we’re going to compare Tf-idf versus the word to Vac Doc to Vac algorithms in other vids. We’ve already taken a look at the overview and upcoming yet to be produced. We’re going to take a look at clustering topic mapping and other things, just for quick context we’re looking at doing text vectorization after we’ve already extracted the terms from a set of documents. Our goal now is to produce a reference term vector that characterizes the important terms within that corpus or collection of documents As I mentioned. There’s two general classes of algorithms that were looking at the first of these term frequency times. Inverse document frequency is a very well-known algorithm and it’s pre-2000 the second of these. And we’re really just going to focus on the doc. Tuvok will leave the word Tuvok alone. And the difference is that word to Vac works on a single word within a document and Dr. Breck works on the whole set of words within a single document. The Word Tuvok and Dr. Vac algorithms are much more recent than Tf-idf. They were invented in 2013 by Mikkel, often team at Google and naturally being invented at Google, they were designed to handle Google sized corpora. So let’s assume that for every document in our corpus, we’ve already done the necessary prerequisites. We’ve done the various term extraction algorithms and we have produced a preliminary bag of words for each document. Let’s think now about just a single document within our corpus and let’s say that it has a total of Z terms in it. These are nouns and noun phrases most often as part of producing this bag of words, we’ll have the term frequency for each term within that document, suppose that we’re going to use the tf-idf algorithm next that means that we’re going to compute the IDF or inverse document frequency for every term in the corpus. We won’t worry about the mathematics of the IDF right now. Let’s just think of it. As a mathematical function that’s computed individually for each term within our corpus by multiplying the TF for each term by it’s appropriate IDF. We get the tf-idf then what we do Is We take that set of tf-idf values and identify a cut-off point, those values that are above it. We keep and those below we simply eliminate. This gives us a resulting reduced dimensionality vector, the important thing for us to keep in mind about Tf-idf. Is that it’s a one-to-one mapping? A single term in the initial vector gives us a single term in the resulting tf-idf vector by way of comparison. That’s look at the dr. Beck method, as I mentioned earlier it was developed. I’m Mickeleh and colleagues at Google and it was designed to deal with very very, very large corpora because the Corpora used in a road to BEC application are typically so very large. We have a big challenge. Our initial vector of terms is also very, very large and we’re trying to do dimensionality reduction. We’re trying to bring this very, very large initial vector down to a reasonable size to do this the dr. Beck algorithm thinks about the words that are near any given word that we’re considering. It does a sort of neighborhood process As a result of this approach of thinking of the neighborhood of words around any given word, we wind up with a many-to-many term mapping that means that each term and input factor maps to multiple terms in the output vector and also every term in the output vector can be mapped to by multiple terms from the input In this case, our cut off deals with the total number of terms that we want to have, and a resulting reduced dimensionality vector, whereas previously with Tf-idf, we were thinking about a numerical value for the cut off that dealt with the smallest tf-idf value that we would include, So let’s do a summary comparison. Tf-idf is a one-to-one mapping where as dr. Beck is many-to-many. When we think about the probabilistic aspects, we see that tf-idf is a simple probability, just the probability of a given term occurring where as word. Tyvek is a covariant probability. We look at one variable occurring in the context of another, so let’s see how these factors impact the practicalities, one of the most important considerations is your corpus size. If you’re dealing with a mega corpus Google Scale, then probably a Doctor. Beck is needed, but if you’re dealing with a smaller quarter corpus and this is true for lots of practical and Industry specific applications, then maybe just a simple TF I’d well highly related to corpus size and really more appropriate we’ve been thinking. About what kind of algorithm to use is how specific or focused. Your corpus is one way to look at. This is to ask yourself. If there is a common or core vocabulary that is typically used with throughout your corpus, It’s not to say that you have to be restrictive. You’re not forcing a controlled vocabulary. But if the same terms are frequently used enough, then probably a tf-idf will work well for you. In this case Using Tf-idf makes a lot of sense, That’s because you don’t need to take any of your commonly used terms and express them. As a vector combination of other terms, every term is pretty much useful and distinctive and relevant on its own, so the third and final consideration is how much you in your role as knowledge Manager have a deep and personal understanding of the corpus content. Do you know enough so that you can personally identify good equivalence terms? That’s not just synonyms, but terms that function well enough that they can substitute for each other. This will let you set up classes of equivalence terms, and that would make your tf-idf much more effective. Also, you’ll want to eliminate stop words terms such as wood or things like action verbs such as accept or provide that are too general to be useful, but they’ve crept in. Nevertheless, let’s do a brief recap with a very large corpus one that spans many topics, and for which you don’t have a great deal of intimate personal knowledge and it’s difficult to model. Probably a doctor Vac will work, but if you have a smaller corpus, it’s very focused. Then tf-idf might be a preferred algorithm, lets. Keep in mind that our current discussion is trust on text vectorization. We haven’t yet begun addressing the next stage. Which is the things that you can do. After you’ve got an appropriately Vectorized representation for each of your documents. Hello again! Thank you for joining me. I’m so glad that we could look together At these two comfortable algorithms in natural language processing today. Now this vid is one in a series of vids on the same topic, and now he algorithms and so following this vid after we’ve had separate bins to discuss separately, tf-idf and also the word to avec Dr Vac algorithms, were going to take it some clustering topic mapping algorithms, for example, K-means and the Well-known LDA or latent dirichlet allocation we’re also going to take a look a little later in the series about the role of Oncology’s, a rarely discussed component of NLP, but one that becomes more and more important. Join me!