The biggest advancement in the field of deep learning in recent years. It is end-to-end deep learning. So what is end-to-end? Deep learning Data processing systems or learning systems. There are cases that require several stages of processing End-to-end deep learning takes those steps and rearranges them. Using just one neural network Let’s take a look at an example. Let’s take voice recognition as an example. Put the input X as the recording file. Let the output y be the script of the recording file. Let the output y be the script of the recording file. Traditionally, speech recognition requires multiple processing steps. First extract the features of the recorded file. People who have heard of MFCC know. It is an algorithm that extracts the features of the recorded file. After extracting simple features, It uses machine learning algorithms to figure out the phoneme of the recording file. Phoneme is the basic unit of sound. The word cat has three sounds. These phonemes are found and grouped together to form a single word And combining the words to create a script for the recording file While you have to go through several steps like this End-to-end deep learning trains, a single network. You can get the recording and get the script at once. The interesting sociological effect of artificial intelligence is As deep learning develops more and more People who have been studying one stage of these pipelines for years. Not only voice recognition, but also many other fields, of course, People in computer vision or in many other fields. I spent a lot of time and published a lot of papers. It also characterizes many parts of your career. I would have invested in engineering parts of the pipeline end-to-end. Deep learning is just through a lot of training sets. When we learned to lead X to Y by skipping several intermediate steps, It hurt a lot of people in a particular field. Because of the new way of building artificial intelligence systems, Intermediate studies are out of date at one moment. The obstacles in end-to-end deep learning are Is that you need a lot of information in advance. When trying to train 3000 hours of material To build a spatial perception system, Existing intact pipeline system will work well, However, if 10,000 hours of data or Or if there are more than 100,000 hours of data. The end-to-end approach will work much better. The smaller the data set the more conventional processing. It works well. I need to have a lot of data. The end-to-end approach will shine. If you have a medium amount of data. It is an intermediate approach, omitting several features. After learning the phoneme of a neural network, there may be several other processes. It’s close to end-to-end learning, but it’s not complete. This photo is a face recognition ticket gate Developed by Baidu’s Yunci Lin. This camera here. Seeing someone approaching and recognizing that person Is to automatically. Open it so that you can pass by. You can enter and exit without the need to use RFID Used by many companies in China. You just need to go to the ticket gate and recognize your face. Without an RFID badge. How do you build a system like this? One thing you can do is see the image captured by the camera. The picture is not good, but let’s say it’s an image from a camera. So someone is approaching the ticket gate. This will be the input image X. What you can do here is the learning function. Is to recognize the person’s identity. Y right from the image X. But this wasn’t the best approach. One problem here. Was that the person approaching the ticket gate? Is that it doesn’’t? Just come in one direction! It could be in this green position here or in the blue position. It’’s close to the camera. So the image may look larger. You’’re already close to the camera and your face may look much larger So. When making this ticket gate, take the image as it is. Instead of putting it into a neural network to find out the person’s identity. The best way today is a multi-step approach. First, it recognizes a person’s face with one software. So, for the first time, we find the position of the face through the recognizer. After recognizing the face you can zoom in on that part. Cut it out and place it in the middle. What I painted in red here represents it. This is what goes into the neural network and guesses or learns identity. Research shows that rather than taking all of this in one step. Better to break it down into two simple steps. The first is to find out where the face is. The second is to recognize who you are by looking at your face. In this way, the two algorithms each handle a simpler task Is to have better performance. If you want to know how this second process works. If you want to know how this second process works Actually, although I kept the explanation simple. The second step is when training a neural network. Getting two images Neural network looking at two images Is to find out if they are the same person or not. If there is an ID of 10,000 employees, Read this image quickly. Compared to 10,000 IDs? Find out if this red picture is one in 10,000 employees. Whether or not to let it into a facility or building. If it’’s a ticket gate, it allows you to enter the workshop. Why does the two-step approach perform better? There are two reasons for that. First, because each problem has become simpler. Second, a lot of data. Because it applies to each task, You can get a lot of material from the first task of face recognition. Looking at the image and finding out where the face is, There must be a lot of data in the form of (x. Y) x is the picture and y is the position of the face. So we can build a neural network to do the first task And there are many separate sources for the second work as well. Today’’s advanced companies. Let’s say you have hundreds of thousands of photos of your face. The red image or the cropped images like below it. Teams of great facial recognition technology have at least millions of images. Looking at both images, You can find out your identity or find out if you are the same person. There is also a lot of material for the second task. On the other hand, if you want to learn everything at once, there will be much less data in the form (x y). When X is the image you will get at the ticket gate and y is the person’s identity. There is not enough information to solve the end-to-end learning problem. It’’s enough to solve the two-step problem. In practice dividing the problem into several, It shows better performance than pure end-to-end deep learning. Even if there is enough, information for an end-to-end approach May not show better performance. But this is also not the most effective way in practice. Let’s look at another example. Take machine translation. As an example, Traditionally, machine translation systems also have complex processing methods. Speak English first. By analyzing the text After extracting the characteristics of characters and going through numerous processes To translate English into French Machine translation, There are many (English. French) pairs end-to-end deep learning works pretty well in machine translation. Because these days Because you can get a lot of X and y pairs. In this example, we are translating English into French End-to-end. Deep learning works fine here. As a last example, Let’s say you look at a child’s hand on an X-ray and guess the child’s age. When I first heard of this, I thought it was a crime scene investigation. Sadly, but if you find a child’s bones When trying to figure out the age of that kid Measuring a child’s age with an x-ray of the hand. It would be less shocking than a crime scene investigation. Pediatricians use this to To determine if the child is growing normally or not. In a non-end-to-end way, you first get the image. You will recognize it by dividing the joints of the bones. Let’’s find out where each bone segment is. After finding different bone joints, You just need to look at the average joint length of a young child. You can measure your child’s age like this. Actually, quite effective. On the other hand, if you want to measure your child’s age right from the image, That’s why we need a lot of information. As far as I know, it’s not an effective method at this point Because there is not enough information to learn this task On the other hand. If you divide this task into two steps. The first is a relatively easy task. You won’t need a lot of information. You don’t need a lot of images to find the bone joints on the x-ray. The second step is to collect some children’s hand size information. It could be accomplished with not so much information. So this multi-level approach is very useful. Maybe it’s more effective than the end-to-end approach when you don’’t have enough information yet to run end-to-end. The end-to-end learning method actually works very well. The system can be simplified a lot. You don’’t have to spend a lot of effort building the intermediate course. But it’s not a panacea. In the next video Of when to use end-to-end learning With a systematic explanation, We will cover how to assemble these machine learning systems.