Transcript:

Hello, people from the future. Welcome to normalise nerd in this video. I’m gonna explain the gang. Yes, the famous generative adversarial networks. I know that this is one of those topics. If you don’t approach it properly, then this might feel really intimidating, but trust me by the end of this video. You will feel very comfortable with gangs now. I put a lot of effort in making these videos. So if you like my content, please subscribe and hit the bell icon. Let’s get started, okay. The first thing that you need to know is gang is not a single model. It’s a combination of two models. The first one is a generative model called G and the second one is a discriminative model called D. Now what the hell are discriminative and generative models? Well, in machine learning, we have two main methods for building predictive models. The most famous one is the discriminative method. Well, in this case, the model learns the conditional probability of the target variable, given the input variable. Most common examples are logistic regression, linear regression, etc. On the other hand in a generative model, The model learns the joint probability distribution of the input variable and the output variable If the model wants to predict something then it uses. Baye’s theorem. And it computes the conditional probability of the target variable, given the input variable. The most common example is the naive. Baye’s model. The biggest advantage of generative models over discriminative models, is that we can use generative model to make new instances of data because in this case, we are learning the distribution function of the data itself, which is simply not possible using a discriminator in our Gantz, We are using this generative model to produce new data points that is we are producing fake data points using our generator, and we are using this discriminator to tell if a given data point is an original one, or it has been produced by our generator now. These two models work in an adversarial setup that means they compete with each other, and eventually both of them gets better and better in their job. Let me show you the structure of this thing. Okay, So here’s the high-level view of our gang this. G and D are nothing but multi-layered neural networks and this theta G and theta D are just the weights. Okay, we are using neural networks here because they can approximate any function. We know that from the universal approximation theorem. Now, look at here, suppose? This is our distribution function of the original data. Now, in reality, we can’t really draw that, or even mathematically compute that because we input images, we mean, put voices. We input videos and they are higher dimensional complex data, so this is only for mathematical analysis. Okay, and look at here. This is a noise distribution and you can see that. This is just the normal distribution. I am taking and I am gonna sample randomly some data from this distribution and we’ll feed that to our generator will to get something from the generator. We must input something right, and we are inputting here. Noise that means this. Z contains no information and after passing this Z to our model generator, It will produce something called G of Z. Now, look at that. I have described the distribution of the G of Z with the same X that I have written for the earlier data. Well, I am doing this because the domain of our original data is same as the range of G of Z. This is important because we are trying to replicate our original data, so just remember the short forms. When I say P data, this represents the probability distribution of our original data. When I say PZ, it represents the distribution of the noise, and when I say PG, it represents the distribution function for the output of our generator and we are going to pass reconstructed data and original data to our discriminator, and this will provide us a single number and the single will tell the probability of the input belonging to the original data. So you can see this. Discriminator is just a simple binary classifier and for the training purpose when you are putting the original data to the discriminator, we will say. Y is equal to 1 and when we are going to pass reconstructed data, we will say the level is 0 and the. D will try to maximize the chances of predicting correct classes. But G will try to fool D so we can say that. G and D are playing a two-player minimax game. What the hell is that well? A Minimax game is just a two-player game like tic-tac-toe where we can interpret the objective as one player is trying to maximize its probability of winning. What the other player is trying to minimize the probability of winning of the first player. Okay, now we are saying about maximizing and minimizing, but what should they maximize or minimize? We need a mathematical expression, right, and that’s called as value function. Let me show you the value function for this Minimax game here well. This is the value function for gaen and here mean, and Max simply represents that G wants to minimize this expression and D wants to maximize this expression now. I know that at first this might feel gibberish, but if you look here closely, you will find that this expression is surprisingly similar to the binary cross Centauri function. And if you are feeling like that, then you are absolutely correct. Let me show you why, so this is our ordinary binary cross interval function for a moment. Just ignore the negative sign and the summation. So this is just the binary kazantip function for one input, right. Why is the ground truth? That is the label and Y hat is just the prediction of the model when Y is equal to 1 that is when we are passing the original data, The wipe read is equal to D of X. So if you just replace these things in the formula, you will get lost to Ln of D of X Now when we are giving the data as our input. The wipe red will be D of G of Z, because obviously first we have passed the noise to our generator and it has produced something and then we are giving the produced fake data to the model B. And if you replace these things in the function, you will get Ln of 1 minus D of G of Z. Now, let’s combine them so I have just added them together and we get this. Does it look similar to the value function? Yes, but here we are missing the capital. Es at the front. Well, they are just expectations, understand that this expression is valid for only one data point, but we have to do this for the entire training data set, right and to represent that mathematically. We need to use expectation. Well, expectation is just the average value of some experiment. If you perform this experiment a large number of times, suppose you are playing a game where you need to roll a die and your score is the number on the upper face. So if you play this game for a really long time, then the expected score is 3.5 The formula is very simple. You just need to add all the possible. Outcomes multiplied with their probability, so it’s kind of a weighted mean, so let’s apply the expectation on this equation and look at here that we are adding all the scores with their probability. Same thing goes for here, but this is only true for a discrete distribution if we assume that our P data and PZ are actually continuous distribution, then the integral sign will replace the summation and we have to place the DX and DZ accordingly, and this whole thing is written in the short form as E OK. So now you know the value function for Gann. Does it look intimidating now? I don’t think so now. I’m gonna tell you how we optimize this function in practice well. This is our big training loop and just like every other neural network, we have to optimize the loss function using some stochastic process. I am using here. The stochastic gradient descent. Okay, so first we enter our big training loop and we fix the learning of G. And then we are entering the inner loop for B. Well, this loop will continue for K steps, Okay, and in this loop first, we take M. Data points from the original distribution and M data points from the fake data. Okay, and then we update the parameters of our discriminator by gradient ascent. Why, because remember that our discriminator is trying to maximize the value function so after we have performed K updates of D, We get out of this loop and we fix the learning of D now. We are going to train our generator. For this case, we take only M fake data samples and update the parameters of our generator by gradient descent. Why, because remembers generator is trying to minimize the value function. Now you might ask why. I haven’t taken this portion in the update. Step of generator. We’ll look closely. Does this expression contains any term corresponding to the generator? No, so the partial derivative of this term with respect to theta? Gu will be zero. That’s why we are taking only this portion. One important thing. You should note that for every key updates of the discriminator. We are updating the generator once, okay. If you have understood this video so far, then you know what is the value function for Gann and how we optimize this in practice, but if you are like me and want to know what is the guarantee that our generator will surely replicate the original distribution, then take a deep breath and continue watching. Okay, just to be clear. We want to prove that. PG will converge to P data. If our generator is able to find the global minimum for the value function in other words, we want to show that. PG is equal to P data at the global minimum of the value function. Okay, this is a two-step process. First of all, we are fixing the G. We wanna see for which value of the discriminator. The value function is maximum. Look here that I have replaced G of Z with X. Well, we can do this because the domain of both of them are same now. If you differentiate this, then you will see that the maximum value of this expression will occur if the D of X attains this expression. P data over P data plus P G. Well, obviously one can differentiate that and attain this expression, but let us look into it ibly, so we can represent our value function like this formula, A LN X plus B ln 1 minus X, and we want to find the value of X for which this expression is maximum. So if I take B is equal to 0.6 and a is equal to 0.45 Then you will see that the graph looks something like this. And the Maxima occurs at Point 4 to 9 which is nothing but a Upon A plus B. Now let’s fix the BX as this and replace that into our value function, so after fixing D and substituting that in the value function, we get this, and after a little modification, we are getting this long expression and here Mimsey just represents that G will try to minimize this thing. Now understand what we want to do here. Well, we want to prove that probability distribution of generator will be exactly same as the probability distribution of the data, so it makes sense to talk about some of the methods to measure the difference between two distributions and one of the most famous methods are. G is divergence That is Jenson Shannon divergence. Now, if you look at the formula for JS divergence, then it looks surprisingly close to this long expression, Isn’t it? Just for a refresher This e here just represents the expectation in the first portion to find the expectation of this value. We are using the probabilities from the first distribution, but in the second portion, we are using the probabilities from the second distribution. Okay, now let’s see if we can somehow get to the. J is divergence from this thing. Okay, so after the little modification, we are getting this. So what have we done here? We have just multiplied two in these two logarithms, and for this we need to subtract two times the Ln two here, all right, and if we look closely here, then this whole portion is actually equals to two times. The J is divergence of P data and PG, And obviously we have the negative two. Ln two here, so G wants to minimize this. What is the minimum value of this expression? Well, the J is divergence between any two distribution cannot be negative the minimum it can get is zero and it will attend zero only when p1 is equal to p2 That is if P data is equal to P G Then only this term will be zero, and the whole expression will attain its minimum. That is minus 2 Ln 2 So voila, now you, we have proved that add the global minimum of our value function. The P G will be exactly same as P data and our generator is actually trying to attain that state. Now, let me show you how. G achieves that state that is different phases of training, so at the beginning, neither the discriminator nor the generator knows what they are doing so the. P G is not replicating the P data and the classifier discriminator is not classifying as well after updating the theta D that is when the discriminator has learned something so the classifier will be better. So now the discriminator can actually distinguish between the real data and the fake data now after the generator has learned something. Look at that, the distribution. P G is now closer to the P data and the discriminator is trying to predict the true level of the data points, but it is not performing as well now at the end when the generator has attained the minimum of the value function, then it has successfully replicated the distribution function of the data point so now. PG is indistinguishable from P data. So now it is impossible for the discriminator to tell which data point is an original one and which data point is a generated one, so the discriminator will output 0.5 for every input, And that is what we want to achieve well. This is a very simplistic view of the gaen In reality training. The Gann is really hard. The goal of this video was to make you understand. Gantz, I hope you are now very comfortable with the concept of Ganz. And if you have understood everything that I have talked about in this video, then do congratulate yourself because now you know the math behind one of the finest inventions in Ai. I hope you have liked this video. Please share this video and subscribe to my channel. Stay safe and thanks for watching [Music].