Transcript:

In this issue, let’s talk about backpropagation That is the core algorithm of neural network learning After reviewing what we talked about earlier First of all, I’m going to go over it intuitively without mentioning the formula. What exactly is this algorithm doing And if any of you want to take a serious look at the math inside I will explain the calculus behind all this in the next video If you watched the first two videos Or if you already have enough background to airborne this video You must know what a neural network is and how it feeds forward information The classic example we consider here is handwritten digit recognition The pixel values of the numbers are input into the 784 neurons in the first layer of the network Here I show that there are 2 layers of 16 hidden layers of neurons The output layer of 10 neurons represents the final choice given by the network I also assume that you have understood the gradient descent method mentioned in the previous issue Understand that the so-called learning means We need to find a specific weight bias to minimize a cost function Just a reminder about the cost of calculating a training sample You need to request the output of the network With the expected output The sum of the squares of the difference between each term Then do the calculation for thousands of training samples and finally take the average This gets the value of the entire network If you think this is not complicated enough, then the content of the previous episode is also mentioned. What we require is the negative gradient of the cost function It tells you how to change the weight bias on all connections So that the price can drop the fastest Center Back Propagation Algorithm in this episode It is used to find this complex gradient I hope everyone can keep in mind the points mentioned in the previous episode After all, the 13,000-dimensional gradient vector It’s not an exaggeration to say that it is unimaginable So please remember another set of ideas here The size of each item in the gradient vector is telling everyone How sensitive is the cost function to each parameter For example, you walked through the process I talked about Calculated negative gradient The item corresponding to this weight on this line is equal to 3.2 And the item corresponding to this side is equal to 0.1 You can understand The first weight has 32 times the influence on the value of the cost function If you slightly change the first weight The change it makes to the value of the generation Is to change the second weight 32 times of the same size Personally, when I first started learning backpropagation I think the most confusing part is the various symbols and superscripts and subscripts. But once you have a clear idea of the algorithm Every step of the algorithm is actually quite intuitive In fact, it is just to carry out many small adjustments one by one. So I will completely abandon all symbols when starting the explanation Explain to everyone step by step How does each training sample affect the adjustment of the weight bias Because the cost function involves Average the cost of thousands of training samples So we adjust the weight bias for each step of gradient descent Will also be based on all training samples In principle But in order to calculate the efficiency, we will make a trick So you don’t have to calculate all training samples at every step One more point We now only focus on one training sample, just this 2 How does this training sample affect the adjustment weights and biases? Now suppose the network is not fully trained yet Then the activation value of the output layer looks very random Maybe there will be 0.5 0.8 0.2 etc. etc. We cannot directly change these activation values, only the weight and bias values. But it’s useful to remember how we want the output layer to change Because we want the final classification result of the image to be 2 We hope that the third output value becomes larger and the other values become smaller And the size of the change should be proportional to the difference between the current value and the target value And the size of the change should be proportional to the difference between the current value and the target value For example, increase the activation value of the number “2” neuron It should be more important than reducing the activation value of the number “8” neuron Because the latter is very close to its goal Well, let’s go a step further and focus on this neuron We want to make the activation value inside it bigger Remember that the activation value is Add a bias to the weighted sum of all activation values of the previous layer Then use the squeeze function such as sigmoid ReLU to finally calculate it. So to increase this activation value, we have three main ways to go One increases the bias and two increases the weight Or three change the activation value of the previous layer Let’s first look at how to adjust the weight Each weight has different influence The weight of the brightest neuron connected to the previous layer also has the most influence Because these weights will be multiplied by the large activation value So at least for this training sample Increase the impact of these weight values on the final cost function Is more than the effect of increasing the weight of connected dim neurons Much bigger Remember when we talk about gradient descent We don’t just look at whether each parameter should be increased or decreased We also see which parameter is the most cost-effective By the way, it’s a little bit like describing creatures A theory of how a network of neurons learns “Hebb Theory” is summed up as “neurons that are activated together are connected together” The biggest increase in weight here is the part where the connection becomes stronger Will happen to the most active neurons And the neurons that want more excitation It can be said to see a neuron that fires at 2 o’clock Will be more closely connected with neurons fired when “thinking of a 2” Here to explain whether my personal opinion of artificial neural networks is really There is no say in mimicking the work of the biological brain The sentence “neurons that are activated together are linked together” is to be annotated with an asterisk But as a rough comparison, I think it’s quite interesting Closer to home, the third way to increase the activation value of this neuron Is to change the activation value of the previous layer More specifically, if all neurons connected by positive weights are brighter If all neurons connected with negative weights are darker Then the number 2 neurons will be more intensely excited Similar to when changing the weight, we want to have a bigger impact It is necessary to make a proportional change to the activation value according to the size of the corresponding weight Of course we can’t directly change the activation value We can only control weights and biases But it’s helpful for the last layer to remember the changes we expect But don’t forget that from a global perspective, this is just the change expected by the number 2 neuron. We also need the firing of the remaining neurons in the last layer to weaken But every other output neuron Have their own ideas on how to change the penultimate layer So we will put the number 2 neuron expectation Add up all the expectations of other output neurons As an indication of how to change the penultimate layer of neurons These expected changes are not only multiples of the corresponding weights It is also a multiple of the change in activation value of each neuron This is actually realizing the concept of “back propagation” We add up all the expected changes You get a bunch of changes to the penultimate layer With these We can repeat this process Change related parameters that affect the activation value of neurons in the penultimate layer Circulate this process from the next layer to the previous layer to the first layer Look at the big picture Remember we were just discussing Does a single training sample affect all weight biases? If we only focus on the “2” requirement In the end, the network will only classify all images as “2” So you have to go through the backpropagation the same way for all other training samples Record how each sample wants to modify the weight and bias Finally take an average Here is the average fine-tuning size of a series of weight biases Strictly speaking, it is the negative gradient of the cost function mentioned in the previous video. At least a multiple of its scalar What is not strictly refers to that I haven’t explained exactly how to quantify these fine-tunings. But if you know all the changes I mentioned Why are some numbers several times higher than others And how to add it up in the end You understand the real working principle of backpropagation By the way, if every step of the gradient descent in actual operation If you use every training sample to calculate, it will take too long So we usually do First scramble the training samples and divide them into many groups of minibatch Each minibatch should contain 100 training samples Then you figure out the step that this minibatch drops This is not the true gradient of the cost function After all, to calculate the true gradient, all samples are used instead of this subset. So this is not the most efficient step down the mountain However, each minibatch will give you a good approximation And more importantly, your calculation will be reduced a lot If you want to draw the path of the network down the hill along the surface of the cost function It looks a bit like a drunk man walking aimlessly down the mountain but at least the pace is fast It’s not like a meticulous person who accurately calculates the direction of the downhill before stepping. Then take a cautious and slow step in that direction This technique is called “stochastic gradient descent” There is a lot of content, let’s summarize it first, OK? The backpropagation algorithm calculates How to modify the weights and biases of a single training sample Not only that each parameter should be larger or smaller It also includes the proportion of these changes in order to reduce the cost the fastest True gradient descent You have to do this for tens of thousands of training examples Then take the average of these changes But it’s too slow So you will first divide all samples into minibatch Calculate a minibatch as a step of gradient descent Calculate the gradient adjustment parameters of each minibatch and keep looping Eventually you will converge to a local minimum of the cost function At this point, it can be said that your neural network is already very good for training data. All in all, we implement every code of the backpropagation algorithm In fact, it more or less corresponds to what everyone already knows But sometimes understanding the mathematics is only half done How to show this broken thing will make people confused So if you want to discuss in depth In the next video, we will present the content of this issue in the form of calculus In the next video, we will present the content of this issue in the form of calculus I hope it will be easier to accept when I read other materials after reading it. I want to highlight one point before closing All machine learning, including neural networks, including backpropagation algorithms, must make them work We need a lot of training data The handwritten number example we use is so convenient Because there is a MNIST database All the samples inside have been artificially marked So one of the most familiar difficulties for people in the field of machine learning Nothing is better than getting labeled training data Whether it’s asking someone to tag thousands of images It’s better to mark other types of data So here you can take advantage of the trend to introduce today’s sponsor CrowdFlower A software platform specially built for data scientist machine learning teams to create training data A software platform specially built for data scientist machine learning teams to create training data They allow you to upload text audio or image data Then let a real person tag you You may have heard of Human-in-the-loop “human intervention” method This is actually the case here Use human intelligence to train machine intelligence Here they also deployed a lot of intelligent quality control mechanisms Ensure data is clean and accurate They have assisted in testing AI project data thousands of times What’s more fun is that everyone can get a free T-shirt this time Visit 3b1b.co/crowdflower Or the only designated link in the introduction on the screen After you register a new account, create a new project You can get a free T-shirt I like this very cool one Then thank CrowdFlower for supporting this issue Thanks also to all patrons on Patreon for funding this series all the way