Transcript:

Unknown: Yeah, in this last video of this lecture. I want to give you a brief teaser of pytorch convenience functions, the linear or fully connected layer, but also like to just wrap up the conventions regarding linear algebra. So what do I mean by fully connected or linear layer? So if we have a multi layer network like shown here, so this part here would be a linear transformation if we don’t consider the activation function And you can also think of this as a fully connected layer. So this is sometimes called fully connected layer. Or sometimes people also call that a dense layer in pytorch. It’s called a linear layer because it’s a linear transformation. So you can also think of it as a linear layer And in the context of Kerris and TensorFlow, people call it a dense layer. So all these things are equivalent. So another dense layer or fully connected layer would be from here to hear. So neural network really is a concatenation of multiple of these fully connected layers, which are then interspersed with these nonlinear activation functions, for example. So you can do this many many times. But also multi layer networks are a topic for a different time for different lecture next week. Just outline already. How that relates to linear algebra here. So how we then, you know, implement a fully connected layer and pytorch is very simple. There’s actually a function called torch and n dot linear. So let’s start with the data set here. So let’s assume I’m just creating some random data. But let’s assume this is some valid training data. So this is an design matrix with dimensionality. What do I have 10 times 510 times, Five input Matrix. And then when we initialize this linear layer, we give it a the number of features and the number. So that’s the input features and the number of output features. So here we have five features like that’s our design matrix. The N Times M Matrix. And let’s say we want. Three output features the output features. If I go back one slide. Are these What I highlighted Here you can think of it. As this here, the net inputs. So that’s basically the number of outputs And they are computed by this linear transformation. Alright, so then each of these linear layers on when you initialize it here has attached as an attribute, a weight matrix and a bias vector. So the weight matrix. You can see it here. Is h times m or in this case? Three times five dimensional, right, so you have three rows and five columns or three by five dimensional. So you can see that this five year matches with this five? So that’s the number of input features And the bias unit is equal to the number of output features. In this case, it’s a vector with three values. So because each output has a bias unit attached to it so here, we would have five inputs in the next slide example. I would modify it here and three outputs. Alright, so and notice here. These are also small, random values. We will talk about later. Why it’s useful to have random values here instead of all zeros. Now here, I’m printing The dimensions for the input, the weight and the bias. So it’s 10 by five three by five and three. Now I’m applying this fully connected layer. So I’m applying it to this X, which is 10 times five and the W, which is three times five and the output. So there’s also plus B. This is a matrix multiplication plus b, which is three dimensional. And the output is, what is it 10 times? Three dimensional, a is 10 times three dimensional matrix. So how does that work? So what is going on here? So you can already probably see. There must be some transpose here for the W right. So that five goes in front and the three goes here, And then maybe it’s compatible with this X, but isn’t really what’s going on. So, yeah, that’s What’s going on. So, yeah, I have a summary of that. So I also have to stare for it, or I have to stare at it for a few seconds And let me see what I’ve written down here. So, yeah, based on pytorch, We have another convention here. Recall in the last video, I mentioned the convention where the W? The transformation matrix or weight matrix is in front of X in pytorch is it’s after X. And I think this makes sense not from a geometry perspective, but from a data flow perspective. Because in this way, we have to use fewer transposes And also it’s kind of symbolizing the way the data flows through the network. So we start with X, and then we multiply it by a weight matrix. W and then we get a So. We have X DOT W resulting in a and stuff like that. So it’s more like this linear linear flow here. Actually, I’m writing, always a, but it’s the net input. So depends on whether we apply the activation function or not, actually. I have it here, no more isn’t Alright? So if we have an input with one training example like this X here this vector, then we can use the rotation. Where X is in front if we transpose. W then the dimensions will match, right. So here, in this case, it’s a one times m dimensional vector. Where w is an M times h two dimensional vector. So the result W will be just one value. The result will be then one time’s h dimensional vector. And if we have N inputs, then we can also keep X in front. We can also transpose. W then in this case. What will happen is that we have an n times m dimensional one here here. The same m times h dimensional one. And here we will have an n times h dimensional one. So in this way, what is nice about this convention is we can keep the same operations here whether this is multiple or only one data point. So this is actually quite convenient from a computational perspective. So if we have code, we don’t have to change much around. And if you don’t believe me that this is the way Patriots, does it Here is the source code if you want to look at it? So this is like the common convention in Pytorch how the linear transformation happens. So, um, yeah, just to conclude, I mean, multiple different ways. We can compute this linear transformation. What’s really important is thinking about it like, always think about how the dot products are computed when writing and implementing matrix multiplication. So because sometimes it’s easy to make mistakes, So things may compute because the dimensions match, but it’s not computing. What you wanted it to compute? So it’s always important to write down what the dimensions are and what you’re computing. And what you expect The output is because they are also theoretical intuition. And convention like having the W in front of the transformation matrix does not always match up with practical convenience like when we write things in code. So here I’ve written down some rules that you might find useful when you’re reading textbooks and things are not the same as let’s say in code. So you can easily transfer late between those concepts. For example, if you have two matrixes A and B multiplying, Those is the same as B transpose A transpose and then taking the outer transpose here. So that’s the same thing And also here these are the same. So I was just writing down some roads here that help you may be navigating switches between code and textbooks. I think you probably won’t read many textbooks because there are not many deep learning textbooks really yet, but maybe in the future sometime. But also in papers, people use all kinds of different conventions. So I think this is like a handy thing to can keep in mind And yeah, also just to summarize traditional versus pytorch conventions. So yeah, there are multiple ways. We can compute this linear transformation like I mentioned. We can have this weight matrix up front. So this would result in in H Times. One dimensional vector. When X is a feature vector M times one dimensional. This is the same using the rule that I showed you on the previous slide as writing it like this same thing. So it gives us a way for the same input gives us the same output. Another way we can write. That is how pytorch class that is putting the X in front, but here we are transposing w here. This gives, or this assumes a different input, though. This assumes that one times m dimensional input. And, yeah, this is actually my preferred representation because this is kind of easy when we go back and forth between one or multiple training examples because now here at the bottom would be the cases where we have N training examples. So if we want to use it the traditional way, we have to use true transposes, which is more work. So this is the usual case. In deep learning, we have usually many inputs and many outputs. So we only have to use one transpose. It’s shorter, And this is also the way Pytorch implements it. So there’s also the Pytorch convention. So just to sum it up for this lecture. So, um, yeah, it’s a little ungraded homework. Experiment you may. Can you have extra time to revisit the perceptron? Numpy code. And without even running the code. Just thinking about it. Can you tell if the perceptron could predict the class labels? If we feed it an array, I’m actually underlining it in an unfortunate way without running the code. Can you tell if the perceptron could predict the class labels if we feed an array of multiple training examples at once? So if we have a design matrix of dimensionality N Times M for testing, would it be able to run after training? So if yes, why, if not what change to the code have we have to make? So you can think of this, And then you can actually run the code right with design matrix as input for prediction And see whether your intuition was correct? And also feel free to open a discussion on Piazza about that. So, yeah, run and verify your intuition. And then how about the train methods? Can we also have some parallelism through Matrix multiplications And the train method having multiple training examples? So does it make sense without? Let’s say, fundamentally changing the perception running room. Would that make sense? So it’s also another thing to think about. Alright, So next lecture, then we will talk about more like a deep learning topic that is not like fundamental, linear algebra. We will talk about a better learning algorithm for neural networks. So we learned about the perceptron rule next last week, but this is actually not a very good learning rule and we will develop a better running rule next lecture.