Transcript:
Hey, everyone, welcome to week four. Okay, so today. I want to show you a little bit about a technique called Tf-idf. But more importantly, I want to talk to you today. About how we’re going to teach computers to read so computers are really good Number crunchers, but they’re not so good at reading words so today we’re going to send our school our computer to school and teach it how to read by converting words into numbers. If you hear if you’ve heard before the idea of language processing or text analytics, what we’re going to do here is going to lay the foundation for being able to process textual data. This is something that’s pretty cool and very powerful. And this is a technique used in the data science space very commonly. So in the last couple of weeks, we’ve been talking about how to use variables that are, you know, structured or well-known, right, so like back in the Titanic? Example, is the person male or female today we’re going to use variables that are derived from text. And I’m going to show you how to do that. Okay, so first, just a quick little definition. Whenever we’re talking about text, we talk usually about a corpus of documents. What the heck is a corpus? Well, it’s just a collection of documents, just like a gaggle as a collection of geese. So I made two documents in my corpus document. A the cat sat on my face document B. The dog sat on my bed, All right, Usually when we’re doing a text analysis, we can use a model that’s called bag of words to represent a document and in the bag of words model. You can imagine that each document is really just a bag of words. And so that’s what I’m doing here. I’m using the split command to split my corpus. I’m sorry, my documents in my corpus into their individual words. You can see that bag of words. B is really just doc. Be split into words in a list, All right, so when we do this, this is called tokenizing and there’s a few different ways to tokenize. This is a very simplistic one. If you notice all of my words are lowercase. There’s no punctuation. There’s no weird characters. That’s pretty nice in real world. It doesn’t usually work out that way. Okay, so my split tokenizer works today and in this simple example, so that’s good, and we have these documents tokenized. So, but how do we get a tokenized bag of words into numbers? There are a few strategies, the simplest one possible, which I always like to start with is to create a vector of all the possible words for each document and then just count how many times each word appears so. I’m going to do that here. First of all, I create a set of the unique words contained across all documents. I’m using sets here because that eliminates duplicates and so I’m creating a union of the two sets bag of words, a and bag of words B and that gives me a list. Actually, um, yeah, so that gives me a set in this case, but it’s really a list of all the unique words in the corpus. Alright, so then. I’m going to keep I’m going to create some dictionaries. One for each one for each bag of words, I’m going to set all of the values for those dictionaries to zero, so here’s example word dicta a which is all set to zero. And then I just go through, and I count all the words in my bag of words, and I increment the zero by one if one is found, and so you can see here that word dict a now has a cat, it has a count, not a cat of all the words that occur and they’re occurrences, so in word dict a we can see cat occurs once face occurs. Once my occurs once and so on alright, so then, Lastly, what I can do is I can take those two dictionaries, and I can stick them into a matrix and here we have that matrix. I’m using pandas for convenience and so and for readability here, but you could just as easily use a numpy array and what this gives me is an example of a numerical representation of both sentences or both bags of words for the computer and so the computer can start -. Now we can build models to analyze all of these unstructured bags of words, and so just like that. We’ve converted a word problem to a linear algebra. Problem in computers are really good at linear algebra. Probably better than us so that’s. Great mission accomplished, but wait. If you notice this my on set, the those are pretty common words, and they all occur on both of these. Uhm, in both of these documents in the corpus, so the problem with the counting strategy is that we use a lot of words commonly that. Just don’t mean much. In fact, the most commonly word you you commonly used word in English language, the makes up seven percent of the words that we speak and that’s double the frequency of the next most popular word, which happens to be of and that’s double the frequency of the next most popular word, which I don’t know. The distributions of words in a lane in language is called a power law distribution, which is the basis for zips law. Here’s the link in Wikipedia. By the way you can get this a document off of my Github. Okay, so if we construct our document matrix out of counts, then we end up with numbers that don’t contain a lot of information unless our goal was just to see who uses the most alright so that that’s where Tf-idf comes in and Tf-idf is just a better strategy for counting words and waiting the counts appropriately so rather than just counting, we can use tf-idf as a the tf-idf score is a of a word to rank its importance, so the tf-idf score of a word W is going to be the term frequency of the word multiplied by the inverse document frequency of the word, where term frequency is the number of times the word appears in the document normalized by the number of words in the document and inverse document frequency is going to a log of the number of documents divided by the number of documents that contained in the word. All right, that sounds a little bit weird, but hopefully, this example, is going to work it out for you. I’m going to be implementing tf-idf and code here. Luckily, forever off for you and me and everyone. We don’t have to do this every time we use. Tf-idf Tf-idf is well known and built into scikit-learn and python and in other places and you’ll be able to use tf-idf just by calling a library. But I want you to understand how it works and so. I’m going to work through some of the code. The first thing I’m going to do is I’m going to compute the term frequencies, and it’s way the way I’m going to do that is I’m going to implement this bit of code. I’m going to count the number of times. A word appears in a document and then I’m going to divide that by the total number of words in the document. And that’s what this does. So then I’ll calculate the term frequency for bag-of-word’s a and the term frequency for bag-of-words B, just like that. Alright, so then moving on compute, IDF does the same thing in this case. I only really need to create a compute the IDF one time for all of the bags of words altogether because IDF refers to the number of occurrences of a word across all documents in the corpus. So again, I just go back to my original definition up here. It’s the log of the number of documents and divided by the number of documents that contain the word. W So first of all, I figure out N. And that’s just the length of the number of documents that’s easy enough. The next parts a little bit harder to count the number of documents that contain a word. W I end up iterating over each document in the document list and then each word and then for words that exist. I’ll increment them by one. That’s a good way to only count one instance of it. Alright, and then, Lastly, I’ll take what I just calculated here which I had stored in a dict and I will take N and I’ll divide it by that previous value, and then I’ll apply log to that and I’ll return that dictionary and so that’s my ID F value, all right, and then, Lastly, we’re going to compute tf-idf, which is really just the TF times, the IDF part. So I’ve done that here, and then Lastly Ill. Stick those into a data frame. Now this hopefully will show you what some of the power of tf-idf is if you remember before. I’ll scroll back up scrolling scrolling scrolling so that the most common words were on my SATA and they didn’t really provide much differentiation. We had two sentences that are very different, right, ones about a cat sitting on a face. The other ones about a dog sitting on bed. If you’re if a computer to look at these, yeah, there’s some differences, but it’s pretty hard to see so down here. In our tf-idf example, it’s a lot easier to see we can see that the words that occur commonly and equally across both documents in the corpus are zero. They’re just not important, we don’t care, but we can see that the first document has values to face and to cat and the second document has values to bed and to dog, so it’s very easy for even human to see and for a computer to see the this one is all about beds and dogs, whereas this one is all about cats and faces. Unfortunately, okay, so that should give you a little bit of intuition and how to F. IDF works and why it would be helpful and next We’re going to cover a use case where we use tf-idf to do some analysis and Ill. See you soon.