Transcript:
arg-max soft-max … StatQuest. Hello there I am josh starmer and welcome to statquest. Today we will talk about neural networks part 5: arg max and softmax. note: this StatQuest assumes that you understand the main ideas behind neural networks, backpropagation, and how neural networks work with multiple input outputs. note: this StatQuest assumes that you understand the main ideas behind neural networks, backpropagation, and how neural networks work with multiple input outputs. note: this StatQuest assumes that you understand the main ideas behind neural networks, backpropagation, and how neural networks work with multiple input outputs. note: this StatQuest is the main ideas behind neural networks, backpropagation note: this StatQuest is the main ideas behind neural networks, backpropagation and assumes you understand how neural networks work with multiple input outputs. and assumes you understand how neural networks work with multiple input outputs. and assumes you understand how neural networks work with multiple input outputs. if not, check statQuests .. links are in the description below. if not, check statQuests .. links are in the description below. In statQuest on neural networks with multiple input and output nodes, In statQuest on neural networks with multiple input and output nodes, We had a fancy neural network that used petal and sepal widths to predict iris types. We had a fancy neural network that used petal and sepal widths to predict iris types. plexus; made predictions for setosa, versicolor, and virginica with wrinkled surfaces. plexus; made predictions for setosa, versicolor, and virginica with wrinkled surfaces. plexus; made predictions for setosa, versicolor, and virginica with wrinkled surfaces. plexus; made predictions for setosa, versicolor, and virginica with wrinkled surfaces. then we inserted petal and sepal widths from an iris we found in the forest and passed the numbers through the neural net then we inserted petal and sepal widths from an iris we found in the forest and passed the numbers through the neural net then we inserted petal and sepal widths from an iris we found in the forest and passed the numbers through the neural net then we inserted petal and sepal widths from an iris we found in the forest and passed the numbers through the neural net then we inserted petal and sepal widths from an iris we found in the forest and passed the numbers through the neural net And we guessed that this iris is multicolored because this output value of 0.86 is closer to 1. And we guessed that this iris is multicolored because this output value of 0.86 is closer to 1. Beam! Now let’s see what happens when the petal width equals 0 and the petal width equals 1. Now let’s see what happens when the petal width equals 0 and the petal width equals 1. Now let’s see what happens when the petal width equals 0 and the petal width equals 1. When we run the numbers through the neural network, When we run the numbers through the neural network, output values are 1.43 for setosa, negative 0.4 for versicolor and 0.23 for virginica. So one of the things we’ve noticed is that the raw output values aren’t always between 0 and 1. So one of the things we’ve noticed is that the raw output values aren’t always between 0 and 1. So one of the things we’ve noticed is that the raw output values aren’t always between 0 and 1. Sometimes a raw output value can be greater than 1 as in cetosa (1.43) and Sometimes a raw output value can be greater than 1 as in cetosa (1.43) and Sometimes a raw output value can be greater than 1 as in cetosa (1.43) and and sometimes a raw output value can be less than zero (-0.4), as for versicolor. and sometimes a raw output value can be less than zero (-0.4), as for versicolor. and sometimes a raw output value can be less than zero (-0.4), as for versicolor. and sometimes a raw output value can be less than zero (-0.4), as for versicolor. This wide range of values makes the interpretation of the raw output more difficult than it should be. This wide range of values makes the interpretation of the raw output more difficult than it should be. This wide range of values makes the interpretation of the raw output more difficult than it should be. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. and this is one of the reasons why raw output values are sent to an arg max layer or a soft max layer before the final decision when there are multiple outputs like this one. argmax looks like something a pirate could say … argmax looks like something a pirate could say … it simply sets the largest value to 1 and all other values to 0. it simply sets the largest value to 1 and all other values to 0. it simply sets the largest value to 1 and all other values to 0. In this example, setosa has the largest value of 1.43, In this example, setosa has the largest value of 1.43, hence arg max will have the final output value of 1 for setosa. hence arg max will have the final output value of 1 for setosa. hence arg max will have the final output value of 1 for setosa. and all other values sets the final output values to 0 for versicolor and virginica. and all other values sets the final output values to 0 for versicolor and virginica. Therefore, when we use arg max, the estimate of the neural network is simply the output with 1 in it. Therefore, when we use arg max, the estimate of the neural network is simply the output with 1 in it. Therefore, when we use arg max, the estimate of the neural network is simply the output with 1 in it. and this makes the output super easy to interpret. and this makes the output super easy to interpret. Bam! The only problem with argmax is that we cannot use it to optimize weights and biases in the neural network. The only problem with argmax is that we cannot use it to optimize weights and biases in the neural network. The only problem with argmax is that we cannot use it to optimize weights and biases in the neural network. This is because the output values of arg max are constants 0 and 1. This is because the output values of arg max are constants 0 and 1. To see why this is a problem, let’s plot the second largest output value 0.23 on a graph. To see why this is a problem, let’s plot the second largest output value 0.23 on a graph. To see why this is a problem, let’s plot the second largest output value 0.23 on a graph. Since this is the second largest value … Since this is the second largest value … arg max outputs 1 for any other output value greater than 0.23, and .. arg max outputs 1 for any other output value greater than 0.23, and .. ..and returns 0 for all other output values less than 0.23. ..and returns 0 for all other output values less than 0.23. And since the slope of these two lines is zero, their derivatives are also zero at the same time !! And since the slope of these two lines is zero, their derivatives are also zero at the same time !! And since the slope of these two lines is zero, their derivatives are also zero at the same time !! And this is if we want to find the optimal value for any weight and bias in the neural network … And this is if we want to find the optimal value for any weight and bias in the neural network … And this is if we want to find the optimal value for any weight and bias in the neural network … then it means we will put zero in the chain rule for the derivative of arg-max. then it means we will put zero in the chain rule for the derivative of arg-max. and then the whole derivative will be zero. and then the whole derivative will be zero. and if we put it in the gradient descent of zero, we cannot step towards the optimum parameter values. and if we put it in the gradient descent of zero, we cannot step towards the optimum parameter values. and if we put it in the gradient descent of zero, we cannot step towards the optimum parameter values. and that means we can’t use arg max for back propagation. and that means we can’t use arg max for back propagation. vaaooöööğğöö vaoooööükooüü !!!!! this leads us to the softMax function. this leads us to the softMax function. Although people want to use arg max for output, they usually use softmax for training. (for train) Although people want to use arg max for output, they usually use softmax for training. (for train) Although people want to use arg max for output, they usually use softmax for training. (for train) let me first say that softMax sounds like a toilet paper brand … let me first say that softMax sounds like a toilet paper brand … but since we are already using a roll of toilet paper to represent the softPlus activation function … but since we are already using a roll of toilet paper to represent the softPlus activation function … but since we are already using a roll of toilet paper to represent the softPlus activation function … Let’s use the soft teddy bear to represent it. Let’s use the soft teddy bear to represent it. Let’s see the soft max function in action !!! Let’s see the soft max function in action !!! First, let’s solve the soft max output for setosa: First, let’s solve the soft max output for setosa: We put the output value of Setosa at the power of e and divide this value by the sum of all the output values put at the power of e. (celestial thing: d) We put the output value of Setosa at the power of e and divide this value by the sum of all the output values put at the power of e. (celestial thing: d) We put the output value of Setosa at the power of e and divide this value by the sum of all the output values put at the power of e. (celestial thing: d) We put the output value of Setosa at the power of e and divide this value by the sum of all the output values put at the power of e. (celestial thing: d) We put the output value of Setosa at the power of e and divide this value by the sum of all the output values put at the power of e. (celestial thing: d) so, in this case. We process 1.43 for setosa, -0.4 for versicolor and 0.23 for virginica. so, in this case. We process 1.43 for setosa, -0.4 for versicolor and 0.23 for virginica. so, in this case. We process 1.43 for setosa, -0.4 for versicolor and 0.23 for virginica. and when we do the math, we get the soft max output value of 0.69 for setosa. and when we do the math, we get the soft max output value of 0.69 for setosa. and when we do the math, we get the soft max output value of 0.69 for setosa. So let’s write 0.69 here so we don’t forget. So let’s write 0.69 here so we don’t forget. Now let’s calculate the softmax output value for the versicolor. Now let’s calculate the softmax output value for the versicolor. When we calculate the softMax value for versicolor, the only thing that changes is the numerator. When we calculate the softMax value for versicolor, the only thing that changes is the numerator. When we calculate the softMax value for versicolor, the only thing that changes is the numerator. When we calculate the softMax value for versicolor, the only thing that changes is the numerator. now we are using e above versicolor, not e above setosa. now we are using e above versicolor, not e above setosa. now we are using e above versicolor, not e above setosa. now we are using e above versicolor, not e above setosa. So let’s put the raw output values in place, Let’s do the math and get 0.1 as the softMax output value for the versicolor. Let’s do the math and get 0.1 as the softMax output value for the versicolor. Let’s do the math and get 0.1 as the softMax output value for the versicolor. Finally, let’s calculate the softMax output value for virginica. Finally, let’s calculate the softMax output value for virginica. Just like before, the only thing that changes is the share. Just like before, the only thing that changes is the share. The raw output value for virginica, now raised above e. The raw output value for virginica, now raised above e. So let’s put in the raw output values … Let’s do the math … and get 0.21 as the softMax output value for virginica. Let’s do the math … and get 0.21 as the softMax output value for virginica. BBEEAM! Let’s take a look at the three softMax output values. Let’s take a look at the three softMax output values. Firstly, 1.43, which is the largest raw output value for setosa Firstly, 1.43, which is the largest raw output value for setosa Notice that it is paired with the largest softMax output value, 0.69. Notice that it is paired with the largest softMax output value, 0.69. Notice that it is paired with the largest softMax output value, 0.69. Likewise, 0.23, which is the second largest raw output value for virginica, Likewise, 0.23, which is the second largest raw output value for virginica, Likewise, 0.23, which is the second largest raw output value for virginica, It matches the second largest softMax output value, 0.21. It matches the second largest softMax output value, 0.21. Finally, lowest raw output value, -0.4 … for versicolor It is matched with the lowest softMax output value of 0.1. It is matched with the lowest softMax output value of 0.1. It is matched with the lowest softMax output value of 0.1. so we see that the softmax function preserves the original order or order of raw output values. so we see that the softmax function preserves the original order or order of raw output values. Beeeeaaaam! Second, notice that all three softMax output values are between zero and one. Second, notice that all three softMax output values are between zero and one. This is something the soft max function provides. This is something the soft max function provides. Regardless of how many raw output values are, Regardless of how many raw output values are, The soft max output values will always be between e zero and one. The soft max output values will always be between e zero and one. Binary beeeeammmmmmmm !! Third, if we add all the softMax output values .. notice the sum will be one. Third, if we add all the softMax output values .. notice the sum will be one. this is, as long as the output values are mutually exclusive, this is, as long as the output values are mutually exclusive, this is, as long as the output values are mutually exclusive, soft max means that the output values can be interpreted as predicted “probabilities”. soft max means that the output values can be interpreted as predicted “probabilities”. soft max means that the output values can be interpreted as predicted “probabilities”. note: I put the word probabilities in quotation marks because you shouldn’t rely too much on their accuracy. note: I put the word probabilities in quotation marks because you shouldn’t rely too much on their accuracy. note: I put the word probabilities in quotation marks because you shouldn’t rely too much on their accuracy. note: I put the word probabilities in quotation marks because you shouldn’t rely too much on their accuracy. The reason you don’t rely too much on the accuracy of these predicted “probabilities” is, The reason you don’t rely too much on the accuracy of these predicted “probabilities” is, The reason you don’t rely too much on the accuracy of these predicted “probabilities” is, they are partly dependent on weights and bias in the neural network …. and .. that they are partly dependent on weights and bias in the neural network …. and .. that they are partly dependent on weights and bias in the neural network …. and .. and that the weights and bias also depend on randomly selected baseline values, respectively. and that the weights and bias also depend on randomly selected baseline values, respectively. and that the weights and bias also depend on randomly selected baseline values, respectively. and if we change the initial values, this can result in different weights and biases that provide a neural network that is at least as good at classifying data. this can result in different weights and biases that provide a neural network that is at least as good at classifying data. this can result in different weights and biases that provide a neural network that is at least as good at classifying data. this can result in different weights and biases that provide a neural network that is at least as good at classifying data. and different weights and biases give different output values. and different output values give different softMax output value. and different output values give different softMax output value. and different output values give different softMax output value. and this means randomly selected initial weight and bias-bias values, and this means randomly selected initial weight and bias-bias values, This is because it will result in different predicted values. This is because it will result in different predicted values. In other words, the predicted probabilities are not only based on input values, but In other words, the predicted probabilities are not only based on input values, but In other words, the predicted probabilities are not only based on input values, but it also depends on the random starting values of the weights and biases. it also depends on the random starting values of the weights and biases. it also depends on the random starting values of the weights and biases. so do not rely too much on the accuracy of these predicted probabilities. so do not rely too much on the accuracy of these predicted probabilities. Tiny beam! Let’s go back to the original neural network with its original predicted probabilities … Let’s go back to the original neural network with its original predicted probabilities … Let’s go back to the original neural network with its original predicted probabilities … and see the general form of the softMax equation. and see the general form of the softMax equation. this i denotes the custom raw output value. this i denotes the custom raw output value. For example, we are talking about the raw output value for setosa when i is equal to 1. For example, we are talking about the raw output value for setosa when i is equal to 1. For example, we are talking about the raw output value for setosa when i is equal to 1. and that means we put setosa here … and we put the raw output value for setosa into the share. and we put the raw output value for setosa into the share. In the denominator, we only have the sum of the raw output values above e. In the denominator, we only have the sum of the raw output values above e. In the denominator, we only have the sum of the raw output values above e. We started with ArgMax, which is easy to interpret, but … We started with ArgMax, which is easy to interpret, but … We started with ArgMax, which is easy to interpret, but … Remember that the job of taking the … Remember that the job of taking the … Remember that the job of taking the … on the contrary, softMax has a variant that can be used for back propagation. on the contrary, softMax has a variant that can be used for back propagation. For example, if we have softMax function for setosa … For example, if we have softMax function for setosa … then the derivative of the predicted probability with respect to the raw output value for setosa; then the derivative of the predicted probability with respect to the raw output value for setosa; then the derivative of the predicted probability with respect to the raw output value for setosa; (predictive value of cetosan) becomes x (predictive value of 1-cetosan) (predictive value of cetosan) becomes x (predictive value of 1-cetosan) (predictive value of cetosan) becomes x (predictive value of 1-cetosan) where the predicted probability for setosa is equal to 0.69. where the predicted probability for setosa is equal to 0.69. And when we do the math … We get 0.21. PS: Some of you may wonder where this derivative comes from … He may already know some of them … For those who want to know, I created a video that guides you every step of the way. For those who want to know, I have created a video that guides you every step of the way. For those who want to know, I have created a video that guides you every step of the way. For those who want to know, I have created a video that guides you every step of the way. Check out StatQuest! BBeeaamm! Now, since setosan also plays a role in the softPlus output, versicolor and virginica … Now, since setosan also plays a role in the softPlus output, versicolor and virginica … Now, since setosan also plays a role in the softPlus output, versicolor and virginica … Now, since setosan also plays a role in the softPlus output, versicolor and virginica … We also need derivatives of the predicted probability for setosa with respect to versicolor and virginica. We also need derivatives of the predicted probability for setosa with respect to versicolor and virginica. We also need derivatives of the predicted probability for setosa with respect to versicolor and virginica. The derivative of the predicted probability for versicolor with respect to the raw output value; The derivative of the predicted probability for versicolor with respect to the raw output value; The derivative of the predicted probability for versicolor with respect to the raw output value; the negative predicted probability for setosa times the predicted probability for versicolor. the negative predicted probability for setosa times the predicted probability for versicolor. the negative predicted probability for setosa times the predicted probability for versicolor. the negative predicted probability for setosa times the predicted probability for versicolor. where the predicted probability for setosa is equal to 0.69. where the predicted probability for setosa is equal to 0.69. and the predicted probability for versicolor is equal to 0.10. and the predicted probability for versicolor is equal to 0.10. and when we do the math we get negative 0.07. and when we do the math we get negative 0.07. finally derivative with respect to virginica; finally derivative with respect to virginica; negative predicted probability for setosa, times predicted probability for virginica. negative predicted probability for setosa, times predicted probability for virginica. negative predicted probability for setosa, times predicted probability for virginica. negative predicted probability for setosa, times predicted probability for virginica. And when we do the math, we get a negative 0.15. And when we do the math, we get a negative 0.15. so unlike argMax, whose derivative equals 0 or is undefined. so unlike argMax, whose derivative equals 0 or is undefined. so unlike argMax, whose derivative equals 0 or is undefined. The derivative of the softMax function is not always zero and we can use it for gradient rise. The derivative of the softMax function is not always zero and we can use it for gradient rise. The derivative of the softMax function is not always zero and we can use it for gradient rise. so … why neural networks with multiple outputs often use softmax for training … so … why neural networks with multiple outputs often use softmax for training … so … why neural networks with multiple outputs often use softmax for training … and we see that he uses argmax with very easy to understand output to classify new observations. and we see that he uses argmax with very easy to understand output to classify new observations. and we see that he uses argmax with very easy to understand output to classify new observations. Threesome BBeaaauuuummmmmmbbb !! Before I go, I should mention this: In one of the previous sections of this statQuest, In the main ideas of backpropagation, We used SSR to determine how well the neural network fits the data. We used SSR to determine how well the neural network fits the data. We used SSR to determine how well the neural network fits the data. However, when we use the softMax function, However, when we use the softMax function, since the predicted values are between 0 and 1, since the predicted values are between 0 and 1, We use something called “cross entropy” to determine how well the neural network fits the data. We use something called “cross entropy” to determine how well the neural network fits the data. We use something called “cross entropy” to determine how well the neural network fits the data. and we’ll talk about “cross entropy” in the next statQuest in this series. Beam! and we’ll talk about “cross entropy” in the next statQuest in this series. Beam! Now it’s time to advertise yourself a little bit … If you want to review statistics and machine learning offline If you want to review statistics and machine learning offline Check out statquest study guides at statquest.org. Check out statquest study guides at statquest.org. there is something for everyone. Hurray. We’ve come to the end of another exciting stackQuest .. We’ve come to the end of another exciting stackQuest .. If you like this statQuest and want to see more, please subscribe. If you like this statQuest and want to see more, please subscribe. and if you want to support statquest, consider contributing to my patreon campaign and if you want to support statquest, consider contributing to my patreon campaign and if you want to support statquest, consider contributing to my patreon campaign who bought one or two of my original songs or a t-shirt or a hoodie who bought one or two of my original songs or a t-shirt or a hoodie or just to become a donor channel member, links are in the statement below. or just to become a donor channel member, links are in the statement below. or just to become a donor channel member, links are in the statement below.