(SIRAJ) Hello world. It’s Siraj. And we’re going to make an app That reads an article of text and creates a one sentence summary out of it using the power of natural language processing. Language is in many ways, the seat of intelligence. It’s the original communication protocol that we invented to describe all the incredibly complex processes happening in our neocortex. Do you ever feel like you’re getting flooded with an increasing amount of articles and links and videos to choose from? As this data grows, the importance of semantic density does as well. How can you say the most important things in the shortest amount of time? Having a generated summary, lets you decide whether you want to deep dive further or not. And the better, it gets the more we’ll be able to apply it to more complex language like that in a scientific paper or even an entire book. The future of NLP is a very bright one. Interestingly, enough one of the earliest use cases for machine Summarization was by the Canadian government in the early 90s for a weather system. They invented called FoG. Instead of sifting through all the meteorological data, they had access to manually they let FoG read it and generate a weather forecast from it on a recurring basis. It had a set textual template and it would fill in the values for the current weather, given the data. Something like this. It was just an experiment, but they found that. Sometimes people actually prefer the computer generated forecasts to the human ones, partly because the generated ones use more consistent terminology. A similar approach has been applied in fields with lots of data that needs human readable summaries like finance. And in medicine summarizing a patient’s, medical data has proven to be a great decision support tool for doctors. Most summarization tools in the past were extractive! They selected an existing subset of words or numbers from some data to create a summary. But you and I do something a little more complex than that. When we summarize our brain builds an internal semantic representation of what we’ve just read and from that, we can generate a summary. This is instead an abstractive method and we can do this with deep learning. What can’t we do with it? So let’s build a tech summariser That can generate a headline from a short article using Kera’s Were going to use this collection of news articles as our training data. We’ll convert it to pickle format, which essentially means converting it into a raw bytestream. Pickling is a way of converting a Python object into a character stream. So we can easily reconstruct that object. In another Python script. Modularity for the win We’re saving the data as a tuple with the heading description and keywords, The heading and description are the list of headings and their respective articles in order And the keywords are akin to tags. But we won’t be using those in this example, We’re going to first tokenize or split up the text into individual words because that’s the level. We’re going to deal with this data in. Our headline will be generated one word at a time. We want some way of representing these words. Numerically, Bengio coined the term for this called word embedding’s back in 2003, but they were first made popular by a team of researchers at Google when they released word2vec inspired by Boyz II Men. Just kidding. Word2vec is a two layer neural net trained on a big label text corpus. It’s a pre-trained model you can download. It takes a word as its input and produces a vector as its output. One vector per word. Creating word vectors, lets us analyze words mathematically. So these high dimensional vectors represent words and each dimension encodes a different property like gender or title. The magnitude along each axis represents the relevance of that property to a word. So we could say king plus man. Minus woman equals queen. We can also find the similarity between words, which equates to distance Word2vec offers a predictive approach to creating word vectors, but another approach is count based. And a popular algorithm for that is GloVe short for global vectors. It first constructs a large co-occurence matrix of words by context. For each word, ie row, it will count how frequently it sees it in some context, which is the column. Since the number of context can be large, it factorize’s the matrix to get a lower dimensional matrix, which represents words by features. So each row has a feature representation for each word, And they also trained it on a large text corpus. Both perform similarly well, but GloVe trains a little faster, so we’ll go with that. We’ll download the pre-trained GloVe word vectors from this link and save them to disk. Then we’ll use them to initialize an embedding matrix with our tokenized vocabulary from our training data. Well initialize it with random numbers, then copy all the GloVe weights of words that show up in our training vocabulary. And for every word outside this embedding matrix, we’ll find the closest word inside the matrix by measuring the cosine distance of GloVe vectors. Now we’ve got this matrix of word embedding’s that we could do so many things with. So how are we going to use these word embeddings to create a summary headline for a novel article, We feed it. Let’s back up for a second. [INAUDIBLE] first. Introduced a neural architecture called sequence to sequence in 2014. That later inspired the Google Brain team to use it for text summarization successfully Its called sequence to sequence because we are taking an input sequence and outputting not a single value, but a sequence as well [SINGING]. We gonna encode then we decode. We gonna encode then we decode. When I feed it a book, it gets Vectorized. And when I decode that, I’m mesmerized. So we use two recurrent networks one for each sequence. The first is the encoder network. It takes an input sequence and creates an encoded representation of it. The second is the decoder network. We feed it as its input that same encoded representation and it will generate an output sequence by decoding it. There are different ways. We can approach this architecture. One approach would be to let our encoder network. Learn these embeddings from scratch by feeding it. Our training data. But we’re taking a less computationally expensive approach because we already have learned embedding’s from GloVe. When we build our Encoder. Lstm network, we’ll set those pre-trained embeddings as our first layers weights. The embedding layer is meant to turn input integers into fixed size vectors anyway. We’ve just given it a huge head start by doing this. And when we train this model, it will just fine tune or improve the accuracy of our embeddings as a supervised classification problem, where the input data is our set of vocab words and the labels are their Associated Headline words. We’ll minimize the cross-entropy loss using rmsprop. Now for our decoder. Our decoder will generate headlines. It will have the same LSTM architecture as our encoder and well initialize its weights using our same pre-trained GloVe embeddings. It will take as input the vector representation generated after feeding in the last word of the input text. So it will first generate its own representation using its embedding layer And the next step is to convert this representation into a word, but there is actually one more step. We need a way to decide. What part of the input we need to remember like names and numbers? We talked about the importance of memory. That’s why we use LSTM cells. But another important aspect of learning theory is attention, Basically. What is the most relevant data to memorize? Our decoder will generate a word as its output and that same word will be fed in as input when generating the next word until we have a headline. We use an attention mechanism when outputting each word in the decoder. For each output word, it computes a weight over each of the input words that determines how much attention should be paid to that input word. All the weights sum up to 1 and are used to compute a weighted average of the last hidden layers generated after processing each of the inputted words, We’ll take that weighted average and input it into the softmax layer, along with the last hidden layer from the current step of the decoder. So let’s see what our model generates for this article after training. All right, we’ve got this headline generated beautifully. And let’s do it once more for a different article Couldn’t have said it better myself So to break it down. We can use [ retrained ] word vectors using a model like GloVe easily to avoid having to create them ourselves. To generate an output sequence of words given an input sequence of words, we use a neural encoder decoder architecture. And by adding an attention mechanism to our decoder, it can help it decide. What is the most relevant token to focus on when generating new text? The winner of the coding challenge from the last video is Jie Xun, See? He wrote an AI composer in 100 lines of code. Last week’s challenge was non-trivial and he managed to get a working demo up. So definitely check out his repo Wizard of the week. The coding challenge for this video is to use a sequence to sequence model with Keras to summarize a piece of text Post your Github Link in the comments and I’ll announce the winner next video. Please subscribe for more programming videos and for now. I’ve got to remember to pay attention. So thanks for watching.