WHT

Gensim Python Tutorial | Word2vec With Gensim – Python

The Semicolon

Subscribe Here

Likes

414

Views

31,950

Word2vec With Gensim - Python

Transcript:

hey everyone welcome to the semicolon, in this video we’re going to train a word2vec model with the reddit World News data set using gensim. We’re going to see how to switch between skipgram and continuous bag of words (cbow) for training the word2vec model. We will also try using a pre-trained word2vec model and then we’ll spend some time playing with the model trying out the famous results. So let’s start. The first thing you need to do is install gensim type pip install gensim and if you’re using Windows type that in your anaconda command prompt while this installs let’s open the anaconda navigator to use the jupiter notebook. Gensim is installed now and this is the Anaconda navigator let’s start the Jupiter notebook and create a Python 3 notebook we need to import word2vec for training the word2vec model and keyedVectors for using a pre-trained word2Vec model let’s import pandas and nltk as well. The redditWorldNews.csv file has the news headlines posted on the subreddit world news we’re going to use those headlines to train our word2vec model as I couldn’t find a smaller and better data set for this purpose feel free to use any other data set based on your purpose we are only interested in the headlines so let’s just take the title column we need an NLTK to tokenize the words and if you are doing it for the first time, you might have to download some additional information from NLTK by using nltk.download() like this. Now let’s tokenize the words and have separate array of words for each sentence and this not necessary gensim will convert it into free text anyway you may as well pass just the words, when the corpus is large this takes quite a bit of time so as this goes on let’s let’s write the code for actual training so we use the word2vec method we pass text the minimum word count for a word to be considered for converting into vector and the size of each vector and this doesn’t execute because we need to pass min count not count and executing it start the training and… it’s now trained using model.most_similar() helps us find the ten closest word in the vector space we created. So if you see the most similar words for man were woman girl boy so on. So remember this is a headlines corpus, we don’t expect it to have a very good accuracy but this did decent. Let’s try the famous relationship King – man + woman and it should be queen but it’s not it’s king and this weird and I think it’s due to the quality of the dataset anyway let’s move on. To look at the vector of the word man we can use model[“man”] and this returns the vector and this how man is represented in our vector-space well these methods seem to be deprecated so let’s use the methods in the current version and now these warnings are gone. So was this model skipgram or continuous bag of words(cbow) ? let’s look at the documentation to understand this so this is the word2vec documentation you’ll find APIs, usages and examples here make sure you visit it. Ff you go down and try finding parameters for trainables you can see that with min_count and the size parameter which we have been using we also can pass the SG which is short for skipgram when SG is set to 1 it’s skip gram otherwise it’s continuous bag of words(cbow) so our model was continuous back of words and you know what to do when you want to switch it to skipgram and there are so many parameters you can tweak like the window_size, the cbow_mean the vocabulary size to help with limited RAM if you don’t know what these parameters mean look at the previous video which explains the word2vec and this how we train the word to work model with any corpus and different parameters and different algorithms so now let’s see how to use a pre-trained model let me restart the kernel so that it’s a little easy on my ram let’s go to the word2vec project by Google here you have the C++ code for word2vec on which gensim is based having a C compiler installed speeds the process by around 70 X as claimed by the gensim official website so in this Google word2vec project you’ll also find a pre-trained model it’s 1.5 gb and i’ve already downloaded it and extracted it in the same folder so we just need to import it now the method used for importing is load_word2vec_format() and we need to pass the filename as parameter and set binary flag as true I’m limiting the number of words two hundred thousand words this we’ll just consider the hundred thousand most frequent words making the import faster and… the file name is wrong let me edit it and oh so quick that’s cool so now we have imported the model which was pre trained. Let’s play with it let’s find the most similar words to man again yeah and it’s almost the same result let’s look at the vector of man now this is huge the model we trained we said this size to 32 and here it’s 300 a huge vector why is it giving us the deprecated warning though? isn’t it weird, we use wv to avoid the deprecated warning well it turns out that WV is deprecated on keyedVector and the normal way is deprecated on word2vec. Little weird, no problem, that’s gone let’s play around with this model let’s try some famous results of word2vec. Say… King – man + woman is… the first result is king but the second result is Queen which is right the accuracy is affected by just using 100,000 words maybe let’s try something else Germany – Berlin + France is France. Is this model wrong? oh no it’s Germany – Berlin + Paris and that is France yeah we were subtracting a capital adding a capital so that is right let’s try something else messi – football + cricket messi is not in the vocabulary let’s try M with uppercase and it’s a list of all famous cricketers wow! never thought this would work let’s try some other sport hmm… tennis and Nadal, this is great so this how we train a word2vec model in gensim and this is how we use a pre-trained model and now you can use these vectors as inputs to various neural networks in the next tutorials we’ll pick up a simple use case for different forms of data with deep learning like videos music text and any other if we find it so hit the bell icon and subscribe if you want notifications when that goes live ! thank you

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more