In this lecture, we are going to talk about the CNN, which is one of the most popular during the architectures, and this CNN works pretty well for image processing, NLP and so on, this is a typical scene in architecture, which gets images as an input. And then the first we’re gonna do a composition on this image and it will generate many 50 maps, and then we have a sub sampling which reduce the information from this composition and after that we have a series of layers, which includes compulsions and sub sampling conversion and subsampling and finally all the information from this part will be connected to a linear layer, which is also called a fully connected Or sometimes we call this one as dense net, which is exactly the same as a typical soft. Mcaleese fire and this will predict our labels for Cuban image and then what is the convolutions? So suppose we have a very good image like this? And then it has certain width and height, and in this case, also its algae B color, so it’s going to be depth is 3 so if you are going to use just a simple Softimage classifier, we just gonna use all pixel information and so as an input, but the compositions, the key idea is, we are going to look at only small portion of image at once, so we are going to use a filter which is much smaller than image and they will look at only a small portion of image that is open, called the patch and then we do some operations and then get one value from this. The patch, Of course, this filter can go around the entire image and then we’re gonna see the entire images, but we’re gonna look at only small portions. At a time, that’s the key idea of the compositions. Let’s see how Convolution works with a small image. Rect is 3 by 3 and depth is only 1 This is our image for the filter. We can decide the size, but here let’s say we’re gonna have filtered. Two-by-two and EPS is usually the same as image and then have another convolution so basically using this filter. We look at a small portion of image. Here is 2×2 only we look at 2×2 portion of this image, and they read to help some competition and then we’re gonna move these filters to the lets. How much you’re gonna move? It’s called a stride. So in this case, it’s try this one, so we, which means that we’re gonna move one step to the right, and then we do some operations and we get one number at this. We don’t have any more places to go, so we’re gonna move down. This is our another completion operations, and then finally we’re gonna do here and this will cover entire image. So as a result, we are going to have four numbers, which are two by two numbers. So this is output of our convulsion. Let’s see how we’re gonna get the composition operation using this filter. So here we have. The pixel values are 1 2 3 & 4 5 6 & 7 8 9 and then for the weight value. Usually this value will be changed during our training process, but here, let’s say we have certain values like this, and in the first what we’re gonna do is we’re gonna select small patch using this filter and then using the values inside of it is patch and then at is the filter we’re gonna complete one single. Tell you, what is the operation? The operation is type product, which means the WO dot product with X. This is how we’re gonna compute this one value, and then in this case, what we have here is 0.1 0.5 0.3 and 0.4 and then in X, What we have here is 1 2 4 5 So that product is basically we compute this. We multiply these two value and maybe multiple. Ids to value and on and then, and then we add them all so here. In example, what we have is that zero Point 2 1 times 1 is zero point one plus zero point five times 2 is 1 and so on, so its 1 point 2 This is 1 point 2 and then this one is to go to, and then we add up all then desk lid so three point two four point two and four point three, so this value will be four point three, so four point three will be here and then we just move the window to here and then in the same manner using our filter, we can complete W dot product with X also often we express this on as W transpose and X, It’s a matrix multiplication in this case, so by doing that, we can get this value and then so on, you’re going to move this window again to other area like here, and then we compute this value and so on also, it’s very common to add the padding, usually it’s called the cheer opening because you’re going to put zeros in the boundaries of the images, so by doing that, we can change the output of its map, for example, here with the same filter two by two. You’re gonna look at this pair of image and then second, we’re gonna look at this part of image, and then, Lastly, we are going to look at this, so we have three values in here and then also in down, there also cannot have a three, so if we cannot get three by three feature map, this animation shows how compulsion works in action so using this size of filter we’re gonna move around this entire image, and then we compare to one value and then we create this as a feature map, and then, of course we can think of various size of headings so different size of heading can generate different size of future maps and then for the stride in our previous example, we just use the one stride is one so which means that we’re gonna move one step at a time, but we can move two steps and we can move three steps. This is called the stride. How about images with a certain depth? For example, we have color to algae piece of the apes is 3 so then basically idea is the same. So only difference is that we’re gonna use the filter with the same depth and then again here, using this filter, we look at only small portion of the images, and then we do a dot product, but here we just have a few more numbers to the operation here. Instead of just a 5 5 so 5 5 times 3 so 75 dimension at a product open, also add. Pius to get your son one single value and then by applying this, we’re gonna generate this number. So here is a big image, so 3 2 by 3 2 and then fill T is 5 and if you just think about what will be the output so here you’re gonna start from here’s. Try this one, we’re gonna move one step at a time. So how many steps you can go? That is 28 and then here is the same. How many steps we can go down by one by one? This is also 20 so this is the output of our activation map size, and then, of course we can apply multiple filters so here we do have different filters. Usually, the hair with different values can generate different activation maps, and then we can help many many filters and basically on how many filters we are going to use will decide the depth of generated activation maps. So if I applying these 6 filters, we can create activation map with the taps, 6 to 28 is decided by our original image size with this filter size and similarly we can apply another convolution layer here with the 10 filters. If you are using 10 filters, the output will be depth 10 and also here we are using 5 by 5 filter will change on 28 by 28 images, which can generate 24 by 24 activation maps, and then often we are playing this pulling or subsampling layers after each conversion layer, The idea of it is, is we want to reduce the amount of information generated by convolution layers. So how we gonna do that suppose we have or some the feature map or activation map, she needs to buy previous conclusion layers, and then we are going to use the exactly same filter idea. Here here! Filter size is 2 by 2 which means that we’re going to look at image 2 by 2 error rate Once and then Max Pooling is very commonly used, which means that in this patch we’re going to select the maximum value, which is 6 so we’re gonna copy or use this as an output. And then we’re gonna move this window to here, but how much you’re gonna move is to define the by our stripes. Try this 2 so 2 step. So we’re gonna have this window. And then we look At the maximum value they will disown as an output and so on, so here’s the maximum value and this is the maximum value. So after this max pooling, we are going to have at this output and this animation basically shows how this Max Cooling works in action. So in this area, we select the maximum value and then put that in either output and then also with the idle operations like average pooling which compute averages in this patch. Now we understand how it works in the CN N. And what is the self sampling and this is exactly same. As our previously Softmax Colet’s fire, what’s the main differences between our fully connected to neural net and then CNN, which is called locally connected neural net, the fully connected or neural net. Basically, if you have a Cuban image, you’re gonna read all of them and then refill their as our input. However, in the CNN, we use the small size of the filters and then all the weights are shared and then using small size filters. We look at all the images, so as a result, we’re gonna have much smaller weights. Also, it’s much more flexible to hinder the images. Let’s try to implement this, so in our example, we are going to use. M These two datasets with the two combination layers and the one fully connected layer for the composition layer we can just implement using the API, which is called combo 2d which takes input as in channels and our channels and econo science, so in our example to imitate one color. So the in channel size is 1 the our channel. You can decide how many outputs you want to generate so here. Let’s say we want to generate the 10 channels and the kernel size you can decide. Also in this case is our Chrono side is 5 by 5 The mix folding is even simpler. We can just provide a concise. How big is how big you want to see it? Once for the pooling and how can implement this one? It’s exactly like our softmax classifier. We’re just going to use one linear. Let’s get into the implementation here so first we cannot define our class and then in our init, we define all the components that we need so here, come to the, for example, takes as 10 in channel and then generate a 10 our channel kernel size is 5 and the second we are going to use 10 in channel because we generate 10 so it is 2 value must be the same and they will generate 2000 channels and then the cost is 5 and then we can define this Max pooling and we can define fully connect layer in the forward. We basically connect them all together. So X is an input with a convolutional layer and they reused Max. Pooling in the yellow and this output is used. Second layer is inputs and so on and then at this point, what we have is that some activation maps with the 20 channels, so we want to flatten them to fit to linear. So this is how we gonna flatten and this tensor so the view and and we flattened read with this emphasize. This is basically N page size, and then the rest will be computed automatically, so we flatten this ones, and then we feed there to our fully connected layer and this will create the final output and they repeat this one to lock so 10×10 What is the right value here? So the question here is that what is the size that we can? I get either finish all this commotion later. This is a big question so obviously for keep an image size and then comfort even filter size. You can compete one by one, however, if you’re not sure just to put any random value like this, and then we just execute this program and then tight what you will complain, so it says here what I fit is the 64 by 320 Metrics try to multiply with your weight, which is hundred to ten is a mismatch, right, so which means that after flattening is what we get. Is these ones? So this 64 sounds like page size n and here is our flatten vector tensor size 320 So what we the right value here? So these two numbers must be equal, so our right number for this one is 320 so we don’t have to really compute just to put any number and they run and it will give you the numbers, also, if you want to know. What is the size of this tensor is just a print out here exercise, and they will give you the size of these tensors in the see. Everything is okay, and then the rest part is the same you can use exactly same way we change in the softness classifier, and maybe you can get the results just to do forward and then computer loss, and then we took backward and Devito update and then you see our epic ozone, the loss is going down and then the accuracy now we get 98% previously, it’s something 90% but now using the CNN, we get much much better result now because we understand the CNN, so we can connect more layers, right, three layers, four layers, and then also we can use many layers for the fully connected, so you can try this a little bit deeper network as an exercise, also in each kernel, you can make different size of Oconnells and see which one works better in our next lecture. We are going to talk about much more exciting, all the bends, design and architectures.