Transcript:

All right, hi, everyone! Thank you for coming early this morning to our talk today. I’ll be presenting on two papers That made a concurrent discovery on the same idea, the first paper of being categorical route parameterization with Gumbel, Softmax, okay, and the second paper of being the concrete distribution, a continuous relaxation of discrete random variables by Chris Madison and Rimi and Ey 10 at D Minds and our paper was in collaboration with Shane Gu and Ben Pool at Google Brain. I’ll be using the term concrete and Gumbel Softmax interchangeably during our talk for the remains of the slides and such as when I’m referring to a specific experiment from one of our papers, so the Venerable back. Propagation algorithm relies on having chains of continuous functions In your each of your layers of your neural network. However, a lot of the architectures that were interested in exploring these days where deep learning research are fundamentally using discrete operations. For instance, in language models, we have sequences of words or character tokens that are being sampled where each discrete token corresponds to a word or a character. The popular LCM recurrent neural network architecture has internal gating mechanisms that are used to forget and learn long term dependencies, And although in practice, these gating units are continuous, we still think about them as being able to switch on and off, so there’s some kind of discrete way of thinking here. A more complicated example in recent papers is by Alex Grave’s, Neural Turing Machine Architecture, or the or the DNC Architecture, where you have discrete indexing operations into some kind of memory and what these discrete operations are doing is acting as hard addressing mechanisms and this allows us to express rich computation and this list goes on and on there’s recurrent models of visual attention. Hart, attention, models, differential data structures and we also think of a lot about discrete stuff in unsupervised learning intention and quantized compression. So now that the motivation is here. I want to talk a little bit about stochastic gradient estimation. Let’s say you want to compute the gradients of a of some parameter of a distribution with respect to some function of the samples. The act of sampling breaks the dependency from the parameters to the sample, so it’s difficult to back propagate through stochastic nodes. However, the most genetic method is the generic method is to use a score function estimator or reinforce, but I’ll review a technique That’s become very popular recently, which is known as the Reaper Ammeter Ization trick. If you can re-expres’s the sample that you’re currently assembling from as a function, deterministic function of your parameters and some independent noise, then you can actually factor Gate your sample with respect to the parameters And the most common example of this is Gaussian’s, which is very used in very much used in Variational auto encoder models. What is the stochastic notice discrete? This is we’re interested in using stochastic discrete nodes because we can use them to Train sigmoid belief, nets and Helmholtz machines. So the motivation here is that we want to derive with very similar reaper analyzation trick For discrete random variables for deterministic discrete step functions, it’s common practice to use to approximate continuous relaxation of a deterministic discrete stuff with sigmoids and soft maxes, And in fact in like physics simulators, hard contact events are often smoothed out, using some kind of continuous relaxation, so we’re going to combine very similar ideas between we’re going to combine the ideas of reprime, ization and smooth relaxation, a somewhat obscure way to sample from the categorical distribution is known as the Gumbel. Max, trick where you take the class probabilities. Also, 1 Alpha 2 Alpha 3 You take their logs and to each of these logits you add. Gumbel noise, which you can just sample by taking 2 logs of some uniforms. And if you take the biggest one of these perturb logits and set that to 1 and the rest of them to 0 you’ve sampled from categorical distribution, however, you can’t back propagate through arcamax because the gradients that you get out of it are 0 so they’re not very helpful and what we propose in each of our papers is to simply replace the ARG Max with a soft Max. In fact, this is basically the the meat of the paper, and we use a continuous approximation of soft Max. So the output of this sampling mechanism is actually a continuous, soft variable rather than a quantized representation. The parameter Lambda is is the softmax temperature, which corresponds to how much the winner-take-all dynamics happen When you’re applying the softmax. So if Lambda is very small, then you actually get very close to a quantized categorical sample, and this is nice, because now we can just back propagate through to our load. Jets and get free parameters, Ation gradients the way that we use this architecture for training stochastic neural networks is to basically drop it into the neural network of interest. We train the continuous version, and then we evaluate with a discrete one in the Gumbel Softmax paper. We have an additional variant where you can actually just pass the quantized version in the forward pass, but in the backward pass, you take the gradients with respect to the soft samples, and this is very similar to the straight stress. You made it proposed by Sammy. Benjo in 2013 okay, experiment. I’ll be presenting a selection of results from both papers and for a full set of experiments. Please read our papers. The goal of structure diaper prediction here is given the top half of an in this image. We want to predict the bottom half and the optimization objective here is to maximize the log likelihood of the image, this results taken from the concrete distributions paper. And you compare it to a state of the func fit of the art score function Estimator called middle and we see that the concrete /. Gumbel Softmax estimator we use performs better on the quantized test set than middle. Alright, next set of experiments. I’m sure the crowd the Bayesian deep learning crowd here is pretty familiar with autoencoder’s, but I’ll just quickly review the slide. The idea of Variational auto-encoders is to jointly train a generative model and an inference model and it’s very much like the structured oper prediction task earlier, but instead, in addition to the log probability reconstruction error term, we also penalize the network for deviating from some prior about the distribution now. The sir approaches that our papers took to training these V. Ae models in the concrete paper. They use a continuous relax density for the VA objectives, so the elbow that they’re computing is with respect to the relaxed distribution in the Gumbel. Softmax paper. What we do is use the categorical prior on our soft samples, so this makes a no longer a valid lower bound on the elbow, but in fact, we in practice, we find it works pretty well in both approaches. Please note that these are still biased. Estimators for the discrete variational objectives here are some results, so in the case of a single linear layer, this will actually outperforms the concrete. Gumbel Softmax estimator. But when you have more nonlinearities between the stochastic nodes, we observed that the concrete distribution estimator works better in the Gumbel Softmax paper. We also compared it to some other stochastic gradient estimators and we find it does very well. It seems to also train very quickly, which is a nice property you have. In the last set of experiments. We applied a Gumbel Softmax to semi-supervised classifications, so the goal of semi-supervised classification is to learn good image classification, using some label data and a lot of unlabeled data. Dirk King was paper in 2014 Take this approach where they learn an embedding of images in an unsupervised manner and they use that to inform their classification and achieve good generalization. So how does it work? It’s basically the same thing as a variational auto encoder. Except instead they also add a additional categorical latent state. So now the latency is like a joint between some gaussian and a categorical corresponding to the class. When you have label data, you train this just like a regular Gaussian Bae and so the elbow of the reconstruction. X tilde up. There can back propagate through Z since V is a Gaussian and is a Reaper amortizable distribution for the discrete case. The original paper handles this by marginalizing out every instance of the class, so for on the right hand side, you have wise unobserved. And what they do is they make to the attend class classification problem. They make ten copies of the on the left, and so you can still back propagate through Z. However, this is problematic because then the amount of computation time to do training scales with a number of classes, however, the original paper they weren’t able to back prop through the categorical, but now we can so the approach that we do our experiments. We did joint inference in a single pass through both the categorical and the Gaussian and the result is that we can achieve similar accuracy on m-miss at two times, the speed up and it has a number of classes increases the the winds get even bigger, Okay, some takeaways for using Gumbel Softmax slash concrete. We we pros a very. We provide a very nice, low variance estimator for computing gradients of a relaxed distribution for discrete variables. It’s very easy to use. It’s very much a plug-and-play architecture. You just drop it into your network and you don’t really have to worry about changing your loss function or anything like that. The sensitive parameters here is the temperature, so choosing the right temperature is quite important in the concrete paper. They found that two-third’s works pretty well for the size of the classes that we were using and a more another approach you can do is simply start with a high temperature and gradually anneal it down with some early stopping on the validation. Moss, finally we’ll be contributing a tensorflow implementation to open source very soon. Please check that out and finally check out both of our posters. Thank you very much. I’d be happy to take questions. And if Chris is in the audience. Maybe in else, we’ll come here and answer questions with me, thank you. [APPLAUSE] Hello over on the side. Hello, so can you go back to the plot? Quick of the BAE. So in general? Yeah, sorry, yeah. Yeah, the plot the chart anyway. When if you do the if you quantize in the forward pass and then back prop through the relax objective. Do you see? It looked like you had a bit of a performance regression. Do you notice that generally on other problems or yes? So you can think about this as a biased version of your biased estimator, so it’s like there is a performance gap from using what we call it straight through. Gumbel, softmax question. I was curious. If you have experimented with using this for a binary auto encoder using a binary also enclosure, the binary like a variational auto encoder or basic, simply a Bachman auto encoder, simply a binding other enclosures. Oh, I have my analog signal, and I want to code it in a binary, right, No, so in our in the concrete paper, we simply looked at the binary auto encoder, which just amounts to include it just amounts to having an extra penalty on the loss, but not just simply the auto encoder. So you [Applause].