Transcript:

Hello, world, It’s Siraj. What hyperparameters Should you use to train your models? You will see these magic numbers ALOT. They are the model values that are set before you train on any data set. A machine learning model is just a formula with a number of parameters that need to be learned from data. But there are also parameters that can’t be directly learned from the regular training process. We call these higher level properties Hyperparameter’s. This could be number of trees in random forest number of hidden layers in neural network, the learning rate for logistic regression. It is a process of trial and error, and it is not very intuitive since we are not great at interpreting high dimensional data. Researchers consider the possibility space of hyperparameters thier canvas. But what if we could have these parameters learn the optimal value used for themselves that would make life easier, right, Let’s see if we can figure out really basic strategy for ourselves and then try to improve it. I have got a data set of tweets that labeled as positive or negative perfect for a binary classification problem And let’s say I build a support vector machine to learn this mapping, so it can then classify a new tweet immediately. This is called sentiment analysis. It is a really popular task in language processing. If we mapped out these vectors in 2D space, we can imagine a curly line that separates the positive tweets from the negative ones. This is a decision boundary separate but equal. A support vector machine can help us define this decision barrier. Since it is non-linear, our SVM will use what is known as the kernel trick. That means instead of trying to fit a non-linear model. We can map the data from the input space to a new higher dimensional space called the feature space. By doing a non-linear transformation, using a kernel or similarity function and then use a linear model in the feature space. We define our kernel or similarity function between tweet vectors as the radial basis function, which takes its inputs from two vectors and outputs a similarity based on the following function. So the more similar two tweets are the higher the output value from our function. There are two hyper parameters that govern in how our line is going to be drawn. Both of these hyperparameters need to be selected very carefully. They depend on each other in unknown ways, so we cannot just optimize one parameter at time, then combine the result. What if we just tried every single combination of hyperparameters? Assuming we built our SVM already, we can choose a set of possible values for both of them and create a variable to store our model’s accuracy for each set. Then we will create a nested for loop for every value of C and try every value of Gamma Inside our loop. We will initialize our SVM with the hyperparameters. At that iteration, we will train it and score it. The compare its score to our best score. If it is better, we will update our values accordingly. This process will run for every hyperparameter value We have until it finds the optimal ones. This technique is called grid search. We essentially made a grid of our search space and then evaluated each hyperparameter setting at the points we introduced for as many dimensions as necessary. This was a pretty easy strategy to implement. But this scales pretty poorly with more hyperparameters or dimensions. We add Also known as the curse of dimensionality. I think we can do better than a exhaustive search. We tried every combination of a preset list of values of our hyperparameter. But what if instead we tried random combinations of a range of values for a number of iterations we define. This won’t guarantee that we will get the best hyperparameter combination like grid search. But it will take a lot less time. So manual search grid search and random search are fine and dandy. But there is got to be a more intelligent way of doing this that incorporates learning. One technique that is very population right now is called Bayesian Optimization. Last episode, we talked about how Baye’s theorem is a way to determine conditional probabilities. It shows us how to update a existing prediction given new evidence. This forms the basis of the bayesian way of thinking as apposed to the frequentist approach. These are the two different approaches to probability. Basically, it’s like a mathematical gang war between applied statstician’s. Bayesian means probabilistic. It focuses on the probability of the hypothesis given the data. That means the data is fixed and the hypothesis is random. The frequentist approach focuses on the probability of the data given the hypothesis. So data is random as in. If we repeat the study, the data might come out differently. But the hypothesis is fixed. We can apply frequentist or Bayesian methods to pretty much any learning. Algorithmthey have different aims. In the context of hyperparameter’s optimization, a bayesian approach takes advantage of the information our model learns during the optimization process. The idea is that we pick some prior belief about how our hyperparameters will behave and then search the parameter space by enforcing and updating our prior belief based on our ongoing measurement. So the tradeoff between exploration making sure we visited the relevant corners of our space and exploitation. Once we found the promising region. Our space, finding optimal value in it is handled in more a intelligent way. You know, we only have few weeks left to submit our Bayesian optimization uses previously evaluated points. To compute a posterior expectation of what the loss F looks like Then it samples a loss at a new point that maximizes some utility of the expectation of F. That utility tells us which regions of the domain of f are best to sample from. This 2 step process is repeated until convergence. For the prior distribution, we assume that f can be described by a Gaussian process. A Gaussian Distribution – often called a normal distribution. Is described as a bell shaped curve. Distributions are equations that link outcomes of a statistical experiment with its probability of a current. The Gaussian is quite popular. Half of the data falls on the left of the mean. Half falls on the right. And this is useful in many situations. A Gaussian process is a generation of the Gaussian Distribution over functions. Instead of random variables, While Gaussian Distribution are specified by their mean and variance, Gaussian Processes are specified by their mean function and co-variance function. The way we find the best point to sample f next from Is to pick the point that maximizes an acquisition function. This is a function of the posterior distribution over f. That describes the utility for all values of the hyper-params. The values that has the highest utility will be the values you compute the loss for next. We’ll use the popular expected improvement function. Where x is the current optimal set of hyper parameters? By maximizing this, it will give us the point that improves on f the most. So given on the observed values F of X? We update the posterior expectation of F using the GP model. Then we find that the new X that maximizes the acquisition function, the expected improvement. And finally compute the value of f for the new x. Initially, the algorithm will explore the parameter space. But it quickly discovers the region with best performance and samples points in that region. To Summarize, we can optimize our hyper parameters using several strategies. But Bayesian Optimization looks most promising Bayesian Optimization picks a prior belief about how the hyper parameters will behave. And then Searches the parameters space by enforcing and updating that prior belief based on ongoing measurements, So Bayesian let their prior beliefs influences their predictions. Frequentists don’t!