In this video, you learn about the beam search algorithm in the last video. You remember how for machine translation given an input? French sentence, you don’t want to output a random English translation. You want output the best, the most likely English translation, the same is also true for speech recognition where given an input audio clip. You don’t want to output a random text transcript of that audio you an output the best. Maybe the most likely text transcript beam search is the most widely used algorithm to do this and in this video. You see how to get beam search to work for yourself. Let’s describe beam search. Using our running example of the French sentence. Jane’s is equally free concept on hopefully being translated into. Jane visits Africa in September. The first thing beam search has to do is try to pick the first word of the English translation. That’s going to output so here. I’ve listed, I’ll say ten thousand words in the vocabulary and to simplify the problem a bit, I’m going to ignore capitalization, so I’m just listing all the words in lowercase, so in the first step of beam search or use this network fragment with the encoder shown in green and the decode is shown in purple to try to evaluate What is the probability of that first word. So what’s the probability of the first output Y given the input sentence X given the French input, so we’re as greedy search will pick only the one. Most likely words and move on beam search instead can consider multiple alternatives, so the beam search algorithm has a parameter called B, which is called the beam width, and, for this example, I’m gonna set the beam width to be equal to three and what this means this beam search because they’re not just one possibility, but consider three at a time, so in particular, let’s say evaluating this probability over different choices. The first word it finds that the choices in Jane and September are the most likely three possibilities for the first words in the English output. Then beam search will store away in computer memory that it wants to try out all three of these words, and if the beam with parameter was said differently if the beam width parameter was say 10 then we keep track of not just three, but of the ten most likely possible choices for the first word so to be clear in order to perform this first step of beam search. What you need to do is run the input French sentence through this encoder network and then this first step of the decoder network. If this is a soft max output over all 10,000 possibilities, then you would take those 10,000 possible outputs and keep in memory, which were the top of the week. Let’s go on to the second step of beam search, having picked in Jane and September as the three most likely choice of the first word. What beam search will do now is for each of these three choices. Consider what should be the second word. So after in, maybe the second word is a or maybe it’s. Erin, I’m just listing words from the faux caviar from the dictionary or somewhere down to this will be September somewhere down. The list is visit and all the way to Z. Maybe the last word is Zulu So to evaluate the probability is second word, it will use this neural network fragments where there’s the encoder and green and for the decoder portion when trying to decide what comes off the in. Remember the decoder first outputs why that’s one, so we’re going to set to this Y hat one to the word in and feed this back in. So this is the words in because it decided for now that because we’re trying to figure out it, The first word was in what is the second word and then this will output, yes. Y hat too. And so by hardwiring y hat one here, really the input here to be. The first word in this network fragment can be used to evaluate what is the probability of the second word given the input French sentence and that the first word of the translation has been the word it. Now notice that what we ultimately care about in this second step of being search is to find the pair of the first and second words that is most likely so not just a SEC, where this most likely, but the pair of the first and second rows it most likely, and by the rules of conditional probability, this can be expressed as P of the first word times P of probability of the second word, right, which you’re getting from this new network fragment and so if for each of the three words you’ve chosen in Jane and September, you’ve saved away This probability. Then you can multiply them by this second probability to get the probability of the first and second words. So now you’ve seen how if the first word was in how you can evaluate the probability of the second word Now If the first word was Jane, you’d do the same thing. The sentence could be Jane a Jane. Erin and so on down to Jane is Jane visits and so on and you will use this neural network fragment and let me draw this in as well were here you would. Hotwire Y had one to eat Jane And so with the first word y1 hat hardwired as Jane. Then this new network fragments can tell you, was it, Probably if the second words give me input X and given that the first word was Jane and then same as above, you can multiply with P of y1 to get the probability of Y 1 and Y 2 for each of these 10,000 different, possible choices for the second word, and then finally, you do the same thing for September, all the words from a down to Zulu and use this network fragment. I just told us in as well, so see if the first word was September, what weather the most likely options for the second word so for this second step of beam search because we’re continuing to use a beam width of three and because there are 10,000 words in the vocabulary you’d end up considering three times ten thousand or thirty thousand possibilities because there are ten thousand here ten thousand year, ten thousand years isn’t being worth been since the beam width times the number of words in the vocabulary, and what you do Is you evaluate all of these thirty thousand options. According to the probability of the first and second word and then pick the top three. So we’re gonna cut down these thirty thousand possibilities down to three year, getting down to the beam width parameter again, and so let’s say that these thirty thousand choices. The most likely were in September and say Jane is and Jane visits. Sorry, this bit messy, But those are the most likely three out of the thirty thousand choices. Then that’s what beam search would memorize away and take on to the next step of beam search so notice one thing. If Beam search decides that the three most likely choices for the first and second words are in September or Jane is or Jane visits, then what that means is that it is now rejecting September as a candidate for the first word of the output English translation, so we’re now down to two possibilities for the first word, but we still have a beam worth of three keeping track of three choices for pairs of y1 y2 before going on to the third step of beam search. Just wanna notice that because the beam width is equal to three at every step, you instantiate three copies of the network to evaluate these partial sentence fragments in the output and it’s because the beam width is equal to three that you have three copies of the network with different choices for the first word, but these three copies of the network can be very efficiently used to evaluate all 30,000 options for the second word, so just don’t instantiate 30,000 copies in the network. You need only three copies of the network to very quickly evaluate all 10,000 possible outputs that that softmax outputs a 4y 2 let’s just quickly illustrate one more step of beam search so said that the most likely choices for first two words were in September, genius and Jane visits, and for each of these pairs of words, we should have saved the way in computer memory, the probability of Y 1 and Y 2 given the input next given the French sentence X so similar to before we now want to consider what is the third word, so is it In September a in September Aaron, all the way down to is in September Zulu and to evaluate possible choices for the third word you use this network fragment where you hardwire the first word here to be in the second word to be September and so this network fragment allows you to evaluate what’s the probability of the third word, given the input French sentence X and given that the first two words are in September in the English output. And then you do the same thing for the second fragment, so like so and same thing for Jane visits and so being searchable then once again. Pick the top three possibilities. Maybe things in September. Jane is a likely outcome or Jane is visiting as likely or maybe Jane versus Africa is likely for the first three words, and then it keeps going and then you go on to the fourth step of beam search that had one more word and on it, girls and the outcome of this process hopefully will be that, adding one word at a time that beam search will decide that Jane versus African September will be terminated by the end of and symbol using that in your system, which is quite common that will find that this is a likely output English sentence. And you see more details of this for yourself in this week’s spring exercise as well where you get to play with beam. Search yourself so with a beam worth of three beam search considers three possibilities at a time notice that, um, if the beam wave was said to be equal to one so considers only one then this essentially becomes the greedy search algorithm, which we had discussed in the last video, but by considering multiple possibilities, say three or ten or some of the number at the same time beam search will usually find a much better output sentenced and greedy search. You’ve now seen how beam search works, but it turns out there’s some additional tips and tricks and refinements that helps you to make beam search work even better lets. Go onto the next video to take a look.