Transcript:

Hey, everyone, this is going to be a series about understanding the intuition behind neural networks visually. I think a lot of blogs, articles and videos portray this term as some sort of forward propagating matrix multiplication thing, but as you’ll see, there’s a lot more deeper intuition. We can uncover with some visualizations. So even if you’re an expert on neural networks, I think it’s still worth sticking around and seeing what I’ve got for you for people new to this topic, the series will build foundational knowledge and intuition. One that I think will be really useful. If you choose to pursue this topic in this video we’re just going to be looking at the structure of a neural network in the next couple of videos. I’ll show you what a trained network looks like and how the training process looks like. Broadly speaking. The goal of neural networks is to find patterns within data. Maybe you’ve heard of the classic example of classifying a handwritten digit given some image of a number how would one develop an algorithm to classify it as a digit from zero to nine neural networks can solve this by looking at a ton of pictures of digits and their labels and finding patterns among them looking at how ones typically look how twos look and how trees look and so on with this information, the neural network will be able to classify images it has never seen before, which is what makes it useful. A neural network is a layered structure. It has an input layer, hidden layers and an output layer. Each of these layers consists of many neurons or the circles. If we’re looking at images, the input layer consists of the pixel values of the image in the case of handwritten digits. We have 10 different categories to classify our input into which are the digits 0 to 9 so the output layer consists of 10 outputs, the hidden layers and the number of layers are arbitrarily chosen. They’re typically just trial and error to see what works best now. The question is, what is the neural network do to answer this? Let’s consider a simple neural network with one input and one output. Let’s say that the goal of this neural network is to classify some weather as good or bad. Let’s say an output of one means that the weather is good and an output of 0 means that the weather is bad. This type of neuron with only binary outputs is called a perceptron think of the perceptron as a switch if the perceptron is on it outputs 1. And if it’s off it outputs 0 let’s say that the weather is good when it’s above or equal to 20. Degrees Celsius. The input space, which is one dimensional, is just a number line and what the perceptron is doing is drawing a boundary line at 20. Degrees and any input above or equal to 20. Degrees turns the perceptron on or active, whereas any input below 20. Degrees turns the perceptron off or inactive, essentially what we do when we train a neural network is to determine where these boundary lines go. Let’s talk about how we do this mathematically. The output value y hat can be written as heavy set of X minus 20. The heaviside step function is quite simple. It just equals 1. When X is more than or equal to 0 and 0 X is less than 0 that is if X is positive. Its 1. And if X is negative, its 0 heavy side of X minus 20. Is the exact same as the piecewise function 1. When X minus 20. Is greater than equal to 0 or 0 otherwise, if we wanted to flip the decisions and have any temperature above 20. Be bad, we simply add a negative coefficient to the input, making it heavy set of minus X plus 20. In general, The formula for the output of a perceptron with a single input is heavy side of wx b. Now, what if I added another input to our question? Let’s say I look at both the temperature and the humidity to determine if the weather is bad and thus we add an extra dimension to our input space moving to the 2d Cartesian plane. Let’s say that if either, the temperature or the humidity is high, the weather is bad or good, otherwise to do this. We would draw a line through our input space and set any point above the line as one and any point below the line is zero. Mathematically, instead of heavy side of wx b. We now have two variables, which I’ll call x1 and x2 We can write this new model. As Heavyset of w1 x1 plus w2 x2 Plus B, which is just the standard notation for a plane and this becomes a line when the plane intersects the plane, Z is equal to zero. We’d like to generalize this even more instead of looking at each variable separately. What if we took all the ws and put them into a matrix and all the XS then put them into a vector. Then this multiplication can be written as a matrix multiplication between the W matrix and the X vector plus B. The values in W are called the weights and B is called the bias. The case for three dimensions is pretty similar, lets. Say, I wanted to look at wind speed two. Now we move our input Space into three dimensions and the decision boundary can be represented by a plane or a hyperplane to generalize. The input Matrix is now three by one and the formula for the output of the perceptron remains the same in general, A artificial neuron takes in some input. Vector X multiplies it by a weight matrix and adds a bias to it and passes it through a function called the activation function. So in this case, the activation function is the heavyside step function. Let’s look at some more activation functions. It would be useful to have a spectrum of outputs instead of one or zero. What I mean is if the output before the activation function is 0.001 The heavyset step function will spit out 1. Even though it’s close to 0 so what we can do is we can take the heaviside staff function and join both ends to get this curve The sigmoid curve, The sigmoid activation function allows us to convert our binary output into a probability, so if the output is 0.5 there’s a 50 chance of the predicted event happening. The reason why we can do, This is because the sigmoid function always takes on a value Between 0 and 1. Another activation function is the rectified linear unit or relu relu has two parts, it equals 0 for negative numbers and the input itself for positive numbers. We can write this as Max of zero Comma X relu is used for the hidden layers of a neural network. We’ll go over why, in part three, so for now, just take it as a given. Another important thing to notice is that none of these functions are linear. This will be important later on this is great. We now have a model for linearly distinguishing between data. The problem is, what if our data set is a lot more complicated? Take this data set. For example, we want to create a model that can predict new points as red or green, given the X and Y coordinates well. This is pretty simple. We can use a perceptron with two inputs and place the decision line over here. We call this type of data set linearly separable now. What if our data set looks like this? If you notice we can’t really draw one line to separate our data, we need to add some non-linearity and more complexity. This is where we talk about neural networks. The idea is that when we layer neurons together, the non-linearity of the activation functions adds up and we can model really complex decision boundaries to understand how neural networks work let’s look at a neural network with two inputs and two outputs. Let’s focus on the first output neuron while we know the value of this neuron is going to be this equation. Where sigma is the activation function here? The first digit of the subscript of a weight signifies which neuron we’re talking about, and the second digit signifies which input the weight is being multiplied by. Similarly, we can do the same for the second output neuron, but using 2 as the first digit of the subscript, but is there a way to put both of these equations into a single equation? Well, first, let’s vectorize our equation, putting our output into a vector and our input into a vector and what we’re missing is some matrix that we can multiply our input by and get the output. We call this the weight matrix. This lets us generalize The equation, sigma of W X plus B into any number of input neurons and any number of output neurons. Furthermore, we can use the same equation for each layer, so each layer has its own weight matrix and its own bias vector. Now, let’s add a hidden layer to our 2 2 network from before the equation for the hidden layer is H is equal to W X plus B, where W is the weight matrix and B is the bias vector also notice that we’re using relu for the hidden layer. I’ll go for why it’s better in another video, but you can also use other activation functions like the sigmoid next. The equation for the last layer is y hat is equal to sigma of W X Plus B. If you fresh off some linear algebra knowledge, especially three blue one runs fantastic series. You would be extremely eager to call this weight matrix times the input vector a linear transformation, and you would be absolutely right for. Now, Let’s ignore the activation function, which if you remember provides non-linearity and just considered wx b the way we view a matrix multiplied by a vector as a linear transformation is to place the unit vectors at the columns of the matrix to start with our unit vectors At 1. 0 and 0 1. These are also called. I hat and j hat. The first column in this matrix is 1. 2 so we move the X unit vector. I hat to 1. 2 The second column in this matrix is 2 1. So we move the Y unit vector J hat to 2 1. For a transformation to be linear, we can only perform rotation scaling shearing and reflection. This means that our origin will always map to the origin. If you don’t get what I’m talking about, I highly recommend you watch three blue, one bronze video on linear transformations going back to our neural network. Let’s see if we can visualize what’s happening as inputs are being passed through a neural network with two inputs, two outputs and one hidden layer. This neural network has randomized weights and biases so first, let’s start with our input Data set in this case. It’s a uniformly distributed chunk of points in a square. The first transformation is multiplying by the weight matrix, which is a linear transformation. Notice how this is. Some combination of rotation shearing and scaling the next step is adding the bias vector. This has the effect of shifting the points in the direction of the bias term. And now we get to the activation function. If you recall, we used relu as the activation function here, essentially, zero is any negative output, leaving only the positive output. Well, the only area that has only positive inputs is the first quadrant supplying relu folds, any input from the other quadrants onto the X and y axes, leaving us with this chunk of points in the top. Right, one thing to notice is that you couldn’t get the shape with a simple linear transformation. This is the importance of the activation function. It helps us tackle more complicated decision boundaries after this. We pass it through yet, another matrix multiplication or a linear transformation. Then we add the bias and finally, the sigmoid function squishes everything into the unit square, because if you remember it always outputs a value between 0 and 1. We complete this linear transformation game in three dimensions, too. Let’s consider this neural network with two input neurons, a hidden layer with three neurons and a final layer with two outputs as before the weights and biases are randomized. We start off in the 2d plane with a uniformly distributed square of points, but now instead of mapping from 2 to 2 we’re mapping from 2 to 3. To do this, we need to rotate into the 3-dimensional space from here. We can perform a linear transformation of the weight matrix times the input and add the bias. Now we need to talk about relu. If you recall, it only preserves positive inputs and in 2d this was the first quadrant so in 3d This is going to be the first octant. Thus, we fold every other input into the plane of the axes, revealing this triangle-like shape with folds in the first octant, and now we do another linear transformation back to 2d and add the bia’s term notice How this blob is completely different from anything you could get with a linear transformation. This once again shows us the importance of the activation function, helping US. Model non-linear decision boundaries. Now, let’s take a look at the same animation, except going directly from the first layer to the last layer to finish off. Let’s talk about neural networks and classification throughout this video of portrayed neural networks as some sort of computational unit, yet the original example. I showed you was classifying handwritten digits. How do we do this well? We have inputs as the pixel of the image and some number of hidden layers, and we have 10 outputs. We assign a digit to one of these 10 values, then we calculate the final layers values and simply choose the highest values digit and so through neural networks, we can not only model functions, but classify data, too in the next video. We’re going to be looking at what a trained neural network looks like with some pretty visuals. In the meantime, I’m going to redirect you to some people who helped me in this video. Young Lacoon and Alfredo Konciani Yan is one of the pioneers of deep learning and Alfredo is a computer science professor who uploads his lectures on Youtube. These are a fantastic way to learn the concepts I taught in this video more rigorously and also learn how to code them up using pytorch and a huge Thanks to everyone supporting me on Patreon, that’s. All I have for you today. Thanks for watching you.