Transcript:
I am Shusen Wang. I am an assistant professor at Stevens Institute of Technology. In this lecture, I will give a short introduction to few-shot learning. Few-shot learning means making classification or regression based on a very small number of samples. Before getting started, let’s play a game. I show you 4 images. Please look carefully. The left two images are Armadillos. The right two are Pangolins. You may have never heard of Armadillo or Pangolin, but it doesn’t matter. You just want to pay attention to their differences and try to distinguish the two animals. If you don’t know their difference, I can give you some hint. Look at their ears and size of the scales. Now, I give you a query image. Do you think it is Armadillo or Pangolin? Most people don’t know the difference between Armadillo and Pangolin. They may have not even heard of Armadillo or Pangolin. But human can learn to distinguish the two animals using merely 4 training samples. For a human, making a prediction based on 4 training samples is not hard. But can computers do this as well? If a class has only two samples, can computers make the correct prediction? This is harder than the standard classification problem. The number of samples is too small for training a deep neural network. Please keep the terminologies in mind: support set and query. Support set is a small set of samples. It is too small for training a model. Few-shot learning is the problem of making predictions based on a limited number of samples. Few-shot learning is different from the standard supervised learning. The goal of few-shot learning is not to let the model recognize the images in the training set and then generalize to the test set. Instead, the goal is to learn to learn. “Learn to learn” sounds hard to understand. You can think of it in this way. I train the model on a big training set. The goal of training is not to know what elephant is and what tiger is. The goal is not to be able to recognize unseen elephants and tigers. Instead, the goal is to know the similarity and difference between objects. After training, you can show the two images to the model and ask whether the two are the same kind of animals. The model has learned the similarity and difference between objects. So the model is able to tell that the contents in the two images are the same kind of objects. Take a look at our training data again. The training data has 5 classes which do not include the squirrel class. Thus, the model is unable to recognize squirrels. If you show an image of the squirrel to the model, the model does not know it is a squirrel. When the model sees the two images, it does not know they are squirrels. However, the model knows they look alike. The model can tell you with high confidence that they are the same kind of objects. For the same reason, the model has never seen a rabbit during training, so it does not know the two images are rabbits. But the model knows the similarity and difference between things. The model knows that the contents of the two images are very alike. So the model can tell that they are the same object. Then I show the two images to the model. While the model has never seen pangolin and bulldog, the model knows the two animals look quite different. The model believes they are different objects. Now, I ask a different question. I have a query image. I show it to the model and ask what it is. The model is unable to answer my question. The model has not seen this kind of object during training. Then, I provide the model with additional information. I show additional 6 images to the model. I tell the model that they are fox, squirrel, rabbit, hamster, otter, and beaver. Now, the model can answer my question. The model compares the query image with each of the 6 images. The model finds the query most similar to the otter image. So the model believes the query is an otter. “Support set” is meta learning’s jargon. The small set of labeled images is called support set. Note the difference between the training set and support set. The training set is big. Every class in the training set has many samples. The training set is big enough for learning a deep neural network. In contrast, the support set is small. Every class has at most a few samples. In this example, every class has only one sample. It is impossible to train a deep neural network using such a small set of data. The support set can only provide additional information at test time. Here is the basic idea of few-shot learning. We train a big model using a big training set. Rather than training the model to recognize specific objects such as tiger and elephant in the training set, we train the model to know the similarity and difference between objects With the additional information provided by the support set, the model can tell the query image is an otter although otter is not among the classes in the training set I am going to explain what few-shot learning and meta-learning are. You may have heard of meta-learning. Few-shot learning is a kind of meta-learning. Meta-learning is different from traditional supervised learning. Traditional supervised learning asks the model to recognize the training data and then generalize to unseen test data. Differently, meta learning’s goal is to learn to learn. How to understand “learn to learn”? You bring your kid to the zoo. He’s excited to see the fluffy animal in the water which he has never seen before. He asked you: daddy, what’s this? Although he has never seen this animal before, he is a smart kid and can learn by himself. Now, you give the kid a set of cards. On every card, there is an animal and its name. The kid has never seen the animal in the water. He has never seen the animals on the cards, either. But the kid is so smart that by taking a look at all the cards, he knows the animal in the water is an otter. The animal in the water is most similar to the otter on the card. Teaching the kid to learn by himself is called meta-learning. Before going to the zoo, the kid was already able to learn by himself. He knew the similarity and difference between animals. Although he has never seen otter before, he could learn by himself. By reading the cards, he knows the animal is an otter. The kid wants to know the animal in the water which he has never seen before. In meta-learning, the unknown animal is called a query. You give him a set of cards and let him learn by himself. The set of cards is the support set. What is meta-learning? Learning to learn by himself is called meta-learning. In this example, letting the kid distinguish different animals is meta-learning. Before going to the zoo, the kid has not heard of otter, but he knew how to related the otter in the water with the otter on the card. In this example, the kid learns to recognize otter using a set of cards. There is only one card for every species. He learns to recognize otter using only one card. This is called one-shot learning. Here I compare traditional supervised learning with few-shot learning. Traditional supervised learning is like this. First, learn a model using a big training set. After the model is trained, we can use the model for making predictions. We show a test sample to the model. The test sample is never-seen-before; it is not in the training set. Fortunately, this test sample is from a known class. The test sample is a husky. It belongs to this class. There are hundreds of samples under the class “husky”. Although the model has never seen this husky, the model has seen hundreds of huskies. It is not hard for the model to tell this test sample is a husky. Few-shot learning is a different problem. The query sample is never seen before. Furthermore, the query sample is from an unknown class. The query sample is a rabbit. It is not among the classes in the training set. The model has never seen any rabbit during training. This is the main difference from traditional supervised learning. The training set does not have a rabbit class, so the model does not know that the query sample is. We need to provide the model with more information. We can show the cards to the model. Every card has an image and a name. The set of cards is the support set. By comparing the query with the cards, the model finds the query most similar to the rabbit card. So the model predicts that the query is a rabbit. Way and shot are terminologies of few-shot learning. K-way means the support set has k classes. In this example, the support set has 6 classes: fox, squirrel, rabbit, hamster, otter, and beaver. So K is 6. N-shot means every class has n samples. In this example, every class has only one sample. So n is 1. This support set is called 6-way and 1-shot. Take a look at another support set. It has 4 classes: squirrel, rabbit, hamster, and otter. So it is 4-way. There are two samples in every class. So it is 2-shot. The support set is called 4-way 2-shot. When performing few-shot learning, the prediction accuracy depends on the number of ways and the number of shots. In this figure, the x-axis is the number of ways, that is, the number of classes in the support set. The y-axis is the prediction accuracy. As the number of ways increases, the prediction accuracy drops. Why does this happen? There is an otter in the zoo. The kid does not know what it is. I give the kid 3 cards and ask the kid to choose one out of the three. This is 3-way 1-shot learning. What if I give the kid 6 cards? Then this would be 6-way 1-shot learning. Which one do you think is easier, 3-way or 6-way? Obviously, 3-way is easier than 6-way. Choosing one out of 3 is easier than choosing one out of 6. Thus 3-way has higher accuracy than 6-way. In this figure, the x-axis is the number of shots, that is, the number of samples per class. The y-axis is the prediction accuracy. As the number of shots increases, the prediction accuracy improves. The phenomenon is easy to interpret. The above is a 2-shot support set. The below is a 1-shot support set. With more samples, the prediction becomes easier. Thus 2-shot is easier than 1-shot. The basic idea of few-shot learning is to train a function that predicts similarity. Denote the similarity function by sim(x, x’). It measures the similarity between the two samples, x and x’. Here are 3 images. They are bulldog, bulldog, and fox. Denote them by x1, x2, and x3. Ideally, taking x1 and x2 as input, the similarity function outputs one, which means the two animals are the same. Taking x1 and x3 as input, or taking x2 and x3 as input, the similarity function outputs zero, which means the two animals are different. The idea can be implemented in this way. First, learn a similarity function from a large-scale training dataset. The similarity function tells us how similar two images are. In the next lecture, we will study the Siamese network which can be a similarity function. The network can be trained using a large-scale dataset such as ImageNet. After training, the learned similarity function can be used for making predictions for unseen queries. We can use the similarity function to compare the query with every sample in the support set and calculate the similarity scores. Then find the sample with the highest similarity score, and use it as the prediction. I use this example to demonstrate how to make a prediction. Given this query image, I want to know what the image is. We can compare the query with every sample in the support set. Compare the query with greyhound. The similarity function outputs a similarity score of 0.2. The similarity score between the query and the bulldog is 0.1. The similarity between the query and the armadillo is 0.03. Do the same for all the samples in the support set to get all the similarity scores. Among those similarity scores, this 0.7 is the biggest. Thus, the model predicts the query is an otter. One-shot learning can be performed in this way. Given a support set, we can compute the similarity between the query and every sample in the support set to find the most similar sample. If you do research on meta-learning, then you will need datasets for evaluating your model. Here I introduce 2 datasets which are most widely used in research papers. Omniglot is the most frequently used dataset. The dataset is small; only a few megabytes. Omniglot is a hand-written dataset similar to MNIST. MNIST dataset is for digit recognition. MNIST has 10 classes; each class has 6 thousand samples. In contrast, Omniglot has over 1 thousand classes. But each class has only 20 digits. This makes the classification for Omniglot harder than MNIST. You can download the dataset using the link or import the dataset using TensorFlow. The dataset has 50 alphabets such as Hebrew, Greek, Latin, etc. Every alphabet has many characters. For example, Greek has 24 letters such as alpha, beta, gamma, all the way to omega. For every character, there are 20 digits written by different people. Here is a summary of Omniglot. It has 50 alphabets including various languages like Latin, Greek, and Hebrew. Every alphabet has multiple characters, for example, Greek has 24 letters. The 50 alphabets have a total of 1623 unique characters. Therefore, the dataset has 1623 classes. Each character was written by 20 different people. It means each class has 20 samples. All the samples are 105-by-105 images. The training set has 30 alphabets, which contain 964 characters and thus 964 classes. The training set contains a total of 19,280 samples. The test set has 20 alphabets, which contain 659 characters and thus 659 classes. The test set has a total of 13,180 samples. Another commonly used dataset is Mini-ImageNet. It has 100 classes such as mushroom, orange, corn, bird, and snake. Every class has 600 samples. The dataset has a total of 60 thousand samples. We have learned the basic concepts of few-shot learning and meta-learning. In the next class, we will study the Siamese network for few-shot learning.