Transcript:
Hi, and welcome to an illustrated guide to LST. Ms. In GR use. I learned Vector MMA machine learning engineering. A I voice assistant space in this video. We’ll start with the intuition behind Lst. M S and G are used. Then I’ll explain the internal mechanisms that allow the LST. M S and G are use to perform so well. I’ll follow up with an example if you want to understand what’s happening under the hood for these two networks. Then this videos for you, LST, Ms. And GR use are the more evolved versions of the vanilla recurrent neural networks during back propagation, recurrent know, now we’re suffering from the vanishing gradient problem. The gradient is the value used to update a neural network’s weight. The vanishing gradient problem is when a gradient strengths as it back propagates through time of a gradient value becomes extremely small. It doesn’t contribute to much learning so in recurrent no networks layers that get a small gradient update doesn’t. Learn those are usually the earlier layers, so because these layers don’t learn. RN ends can forget what is seen and longer sequences does having short term memory. If you want to know more about the mechanics of recurrent neural networks in general, you can watch my previous video in illustrated guide to recurrent neural networks. Lst, Ms. In GR use were created as the solution to short-term memory. They have internal mechanisms called Gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away by doing that it learns to use relevant information to make predictions almost all state-of-the-art results based on recurrent neural networks are achieved with these two networks you can find. Ls CMS and GR use and speech recognition model’s speech, synthesis text generation. You can even use them to generate captions for videos. Okay, so, by the end of this video, you should have a solid understanding of why. Lst, Ms. And GR use or get up processing long sequences. I am going to approach this with an intuitive explanation and illustrations and avoid as much math as possible. Let’s start with a thought. Experiment, let’s say you’re reading reviews online to determine if you want to buy a product when you read review your brain, subconsciously only remember important keywords. If your goal is trying to judge if a certain review is good or bad, you pick up words like amazing and will definitely buy again. You don’t care much for words like this gave all should etc. If a friend asked you the next day, what the review said you wouldn’t remember it word by word, you might remember the main points, though, like it was a perfectly balanced breakfast. The other words would just fade away from memory, unless, of course, you’re one of those people with perfect memory that is essentially what an LST M and a GRU does I can learn to keep only relevant information to make predictions in this case. The words you remember made you judge that with a good review to understand how? Stms or GRU? So chooses lets review two recurrent neural network in our networks, like this first words gets transformed to machine readable vectors, then dione, and processed a sequence of vectors one by one While processing, it passes the previous head state to next step of the sequence in a hidden state access to neural networks of memory, It holds information on previous data that the network has seen before let’s zoom into the cell of an RNN to see how the hidden state is calculated. First, the input in the previous hidden state is combined to form a vector that vector has information on the current input in the previous inputs. The vector goes through. The teen activation in the output is the new hidden state or the memory of the network. The tan activation is used to help regulate the values flowing through the network. The tan function squishes values to always be between negative 1 and 1 when vectors are flung, the rain neural network undergoes many transformations due to various math operations, so imagine a value that continues to be multiplied by. Let’s say 3 you can see how some values can explode and become astronomical, causing other values to seem insignificant. Let’s see what a tan does. A tan function ensures that the values stay between negative 1 and 1 thus regulated and neural networks output. So that’s an R net. It has very few operations internally, but works pretty well. Rn N uses a lot less computational resources than its involved variants. Let’s take a look at LS and LS TM has the same control flow as a recurrent neural network, it processes data sequentially passing on information as it propagates forward. The difference are the operations within the LST upsells. These operations are used to allow to lsdm to forget or keep information now. Looking at these operations can get a little overwhelming so well. Go over this one by one. I want to thank Chris. Ola, sorry about butchered That name. He has an excellent blog post on LS TMS. The following information is inspired by his exceptionally written blog post. I’ll include the links in the description. The core concepts of L SCM sort are cell states and its various gates, the cell state acts as a transport Highway that transfers relative information. All the way down to the sequence chain. You can think of as a memory of the network because the cells they can carry information throughout the sequence processing and theory, even information from earlier time steps could be carried all the way to the last time step, thus reducing the effects of short-term memory as it goes on, its dirty information gets added or removed to the cell state via gates. The gates are just different neural networks that decides which information is allowed on the cell state. The gates learn what’s? Information is relevant to keep or forget during training gates contain sigmoid activation. X sigmoid activation is similar to the 10 instead of squishing values between negative 1 and 1 squishe’s values between 0 & 1 that is helpful to update or forget data because any number getting multiplied by 0 is 0 causing values to disappear or be forgotten. Any number multiplied by 1 is the same value. Therefore, that value stays the same or is kept the network can learn what data should be forgotten or what data is important to keep. Let’s dig a little deeper into what the various gates are doing, so we have three different gates that regulate information flow in an LS TM cell. Ill forget gate input gate and output gate. First we have to forget gate. This gate decides what information should be thrown or kept away information from a previous hidden. State and information from the current input is passed through the sigmoid function values Come out between 0 and 1 The closer to 0 means forget and a closer to 1 means to keep to update the cell state. We have the input gate. First, we passed that previous head and stay in the current input to a sigmoid function that decides which values will be updated by transforming values to be between 0 and 1 0 means. Not important. One means important. You also pass the hidden state and current input into the tan function to squish values between negative 1 and 1 This helps regulate the network. Then you multiply the tan output with a sigmoid output. The sigmoid apple will decide which information is important to keep from the tan output. Now we should have enough information to calculate the cell state first. The cell state is multiplied by the forget vector. This has a possibility of dropping values in a Cell State. If it is X values near zero, then we take the output from implicate and do a polarize Edition, which updates the cell state to new values. This gives us our new cell state last. We have the output gate. The applicate decides what the next hidden state should be remember that the hidden state contains information of previous inputs. The hidden state is also used for prediction. First we pass a previous hidden state in the current input until a sigmoid function, Then we passed newly modified cell state to the tamp function. We multiply the! Tanner output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and a new hidden state is then carried over to the next time. Step for those of you. Who understand better through seeing code here is an example using Python pseudocode to showcase the control flow first, the previous hidden state in the current input gets concatenated. We’ll call it combined combined gets fed into the forget layer. This layer removes non-relevant data. Then the candidate layer is created using combine the output holds possible values to add to the cell state combined also gets fed into the input layer. This layer decides what data from a candidate layer should be added to the new cell state after computing to forget layer the candidate layer in the input layer. The cell state is computed using those vectors in a previous cell state. The output is then computed point wise, multiplying the output and a new cell state gives you the new hidden state, That’s it. The control flow of an LS. TM is simply a for loop. The hem state output it from the LS. TM Cell can be used for predictions using all those mechanisms annales. TM is able to choose which information is relevant to remember or forget during processing. So now we know how an Alice. TM works, let’s look at the GRU. The GRU is a newer generation of recurrent neural networks and is pretty similar to an LSD. M GR use got rid of the cell state and used a hidden state to transfer information. Instead, it also has two gates, a reset gate and an update gate, the update gate acts similar to forget in input Gate of MLS TM. It decides what information to throw away and what new information to add. The reset gate is a gate used to decide how much pass information to forget so that’s. The GR you do use has less tensor operations. Therefore, they are little speedier to train. In LS TMS researchers and engineers usually try both to determine which one works better for their use case to sum. This up are an ends. You’re good for processing sequence data for predictions but suffer from short-term memory. Ls TMS and your use were created as a method to mitigate short-term memory using mechanism called Gates Gates are just no no words that regulate the flow of information being passed from one time step to the next. Ls TMS in G? Reuse are used in stated, they are deep learning applications like speech, recognition, speech, synthesis, natural language, understanding, etc. If you’re interested in going deeper, I’ve added links in a description on some amazing resources that can give you a different perspective and understanding LS TNS and GR use. I had a lot of fun making this video, so let me know in the comments that this was helpful or what you would like to see the next one. If you like this video, please subscribe for more. Ai content thanks for watching.