Transcript:

[MUSIC] This video will explain the hyper band algorithm for auto ML Auto ML refers to the general practice of hyper parameter optimization in machine learning. In this case, this algorithm is going to show a way to speed up the evaluations of different hyper parameter configurations, so hyper parameter optimization can be defined as a discrete search space of the units that make up deep neural network architectures. So this example, there could be six different categories of a neural cell, five different learning rates and seven different regularization –zz altogether. This would make for 210 configurations. The problem with evaluating all configurations is a training deep neural networks. All the way to convergence typically takes a long time, so the idea of hyper band is to speed up the evaluation of each configuration, so this can be done on three dimensions. You could do early stopping, which is where you train it for. Maybe a fixed amount of epochs, or you say the loss isn’t changing too much after each step, so we’re just going to terminate training. You could train on a subset of the data, so for example, on the 50,000 training images for C for 10 you can imagine training on just a subset of maybe even just a hundred images of each class, and that probably a little extreme, but something like that would be would speed up the training because you get through each epoch, much faster, the other idea, and it doesn’t really translate to images as much, although you could think about doing maybe like a grayscale image or sort of like normalizing each pixel value to be more discrete more than 0 to 255 but if you have like a tabular data set, you might train on the subset of the features, so one cool other thing is to construct these heat maps with the hyper parameters explored where the red regions correspond to high classification performance and the blue correspond to really poor classification performance. So these are some of the search algorithms used to search for in these discrete search spaces that make up deep neural networks. You have random search, grid, search, Bayesian optimization, evolution, rainforest and learning and differentiable search in this hyper band paper. They’re going to use random search, but they’re going to speed it up through the you. This hyper band strategy of resource allocation so early stopping is the mechanism used here to save computation, but sometimes early stopping can fail as shown in this picture early on in say, 30 epochs. It’s unclear which function is about to perform better. The v1 function is actually a little ahead of the b2 at that moment. So if you early, stop there, you would conclude that v1 is. I’m sorry v2 is better than v1 But if you you know, train all the way to convert into eventually find out that v1 it’s better than v2 so early stopping can be problematic sometimes, and so you would think about these the metadata on the behavior of the convergence, you can imagine a behavior set where the hyper parameter configuration either fails epically immediately like you have a dramatically terrible learning loss, and then it just stays that way. For the first 20 bucks, it could be slow to improve and then they could improve very quickly so in the previous case. The v1 function was slow to improve so one way of adapting for these different kinds of behavior sets that different configurations can have would be successive. Having where you uniformly allocate. A hyper parameter configuration budget evaluate the configurations throughout the worst half and then continue, continue that, like tournament-style optimization all the way until you just have one hyper parameter configuration remaining, so the issues with successive happening is it’s like how many configurations should you consider and then it’s like this uniform allocation doesn’t really explore the different behavior sets that they can have with their convergence, so hyper band says, randomly distribute the resources rather than uniformly distribute the resources and this sort of allows you to explore the different conversion behaviors that could be inherent in each hyper parameter configuration. So this is the entire hyper vinegar, if M and it’s a little confusing to plug in numbers for our ADA and then go through calculating all the values at the different parameters, but the high-level idea is that there’s an outer loop that defines the number of configurations to throw away at each iteration and the number of resources to allocate like the Max number of resources as you distribute at stochastically, so here’s a simpler idea – rather than the algorithm to demonstrate it. So if you imagine you have a budget, be 500 epochs and you have 16 configurations, then you need 4 rounds if you’re keeping the top half at each iteration, so you imagine chunking B into four bins. So you have 125 epoch resources for each evaluation run, so then you would randomly distribute 125 epochs amongst the configurations in each tournament round, compared to just distributing 125 over N to each configuration and you could also imagine repeating this, like K Times to have a real assurance of the correct distribution of this, which might make let you get away with using less epochs if you repeat it over and over again. So this is an example of a resource allocation where you might have, you know, N is the number of configurations and then R would be the resources of like each one in that bucket, so I can. The first set, like 81 would have one resource and then one would have 81 through 27 So sort of like a weird idea, but this is kind of the idea of hyper band. So these are some of the discrete source bases that they test in their hyper band paper. So with the M NIST lacunae net example, they test the learning rate on this log scale, where it can go from 1 e minus 3 to 1 e minus 1 and the log scale just means that it doesn’t go from, like zero point zero zero one to zero point zero zero three. It goes on the log scale zero point zero zero one zero point zero one zero point one, so they test the batch size and they test the number of hidden units in these two layers, and similarly, this is what the alux net search space would look. They would look like they have the initial learning rate. They have the regularization of each of the layers, and then they have the scheduler of reducing learning rate so another thing they talk about frequently is that there is a non stochastic bandit algorithm and they sort of. I found this to be a weird kind of idea. It’s like is classification performance, really non-stochastic and I don’t think it is because you have initialization the curriculum like the order in which the training set is presented to the model, and then the augmentation, like data augmentation parameters are largely stochastic as well. So I do think that it is a stochastic bandit setting, meaning, Madame, because you have initializations can really really dramatically change. How a model performs. Even if you have the same configuration. You’re not going to get the same result multiple times, so I think it’s really important to repeat the hyperveno’m, so thanks for watching this idea on hyper bend. A hyper band is an idea to speed up neural architecture search through this idea of some kind of strategic resource allocation. Such that you can, you know, explore the confer convergence behaviors quickly and get a sense of which one is going to be worth allocating more resources to if you like this video. I recommend another video made in Henry Ai elapse, neural architecture search. This will show you how you can design a discreet search space for things like neural neural cells, but like the layers, so thanks for watching. Please subscribe to Henry. Ai labs for more deep learning videos.