Transcript:
Hello, everyone Today we’re looking at Roberta. A robustly optimized bird pre-training approached by eun-hyung Leo and I love mainly on Facebook research, so these this paper is a pretty short, pretty simple paper and the main premise is we’ve seen a number of improvements over the initial bird paper, where different, different pre-training of of the transformer architecture or extension extensions of the architecture have been shown to have better performance than the original bird model. And this paper basically says if you get the design choices right then. Burt is able to basically be on par or exceed all of these other methods so far, so they’re they’re basically exploring design choices in the pre training and training of of Burt. Alright, so if you don’t know what Burt is by the way I have made a video about Burt, I’ve also made a video about Transformers, A in very quick terms. Burt is a language neural network architecture that takes as input text, such as this kind of thing. You see here texts such as that and it will kind of encode it out and it can do various things, for example, classify it into certain categories or kind of segmented extract answers from questions and so on that, the whole thing is is pre trained with what’s called a mask language model objective. Which were you don’t need labels to train it. So in a masked language model objective, You basically mask out certain words during training, and then you ask. Burt to reconstruct these words from the surrounding information and that kind of has given some improvements in the original bird paper. But subsequent papers have claimed that you can improve even more by using different retraining objectives and so on such as Excel net. But here, these researchers basically explore different things, so they use a regular bird architecture. That’s what they describe here, so they use both the bird base at 12 layer as well as the 24 layer bird that has originally originally been described, they use masked language modeling as a pre training objective and they explore the necessity of this next sentence prediction loss that has been part of bird so along with the mask. Sentence modeling Bird has also had an objective. Where if you input a piece of actually, you input two pieces of text two sentences such as this these are two sentences and bird has to decide if the second sentence follows the first sentence in the corpus or in 50% of the cases. The second sentence is sampled from a different document. This kind of HA is so the original paper argued. This is necessary to incorporate long distance relationships between text. Yeah, here the the. Nsp objective was designed to improve performance on downstream tasks such as natural language inference and this paper kind of explores the necessity of that loss. In terms of optimization. There is, of course kind of a pre-training scheme and then a training scheme using Adam here with certain parameters and also this paper explores the use of this of these parameters. Lastly, you have data and, of course, these these models, sometimes they’re trained on different data and that’s why they’re comparing them, it makes it a bit harder to compare them because the pre training is done on differently sized and different races differently structured data. This this paper also trust in the state, the influence of the training date and especially what happens if we keep the training data constant so all right, so they implement birthday reemployment Burt. And then they fix some hyper parameters while they tune others and first of all the data set, so they use different data sets. The original Burt has been trained on this book, corpus and Wikipedia English Wikipedia data set, which is 16 gigabytes of large now. This paper here collects a what’s this. CC news data set, which is a subset of the common crawl news data set, which is all in so the subsidies that is the English portion and that’s 76 gigabytes, which is on par with, for example, What Gpt 2 used I believe. So this is a very large training set and kind of comparing this original data to the large corpus kind of what influence that is should make very clear what the influence of more training of more pre training data is now they also have a number of other corpora. Open Web text as well as here. I believe there’s one more stories. Yes, so these are also are pretty sizable, but these are like yeah. These are like half’s very specific schemas to them. Then the evaluation here happens on several different kind of downstream tasks, so the idea. Is you first? You pre train this. Bert model on with the masked language modeling and so on. And then you have this glue task, which is actually a collection of nine tasks and you have some some other tasks such as squad, which is a question answering task and here race. I don’t even I don’t know what that is in particular, but just suffice to say these are kind of downstream. NLP tasks. The paper isn’t about these downstream tasks, but that is just a way to measure. How well your pre-training worked? If then you can fine-tune on such a task and you get a good performance, but what the tasks are in particular isn’t too important, All right, so here we get into the meat of the paper first. They they decide on what they call static versus dynamic, masking, so in the original bird paper whenever they do masked language modeling, they take a piece of text and they basically replicate it a bunch of times because they want to iterate through training data, a bunch of times and then they, in each iteration, they mask out different different tokens and they compare the they compare this to what’s called dynamic, masking, so this is static, masking, dynamic, masking. Sorry, dynamic masking would be where you in each basically on the fly. Generate your mask. You don’t pre compute it and save it. You on-the-fly generate it. This allows you to go through kind of more or less of the data as you want, and when you encounter the same sample twice, even though you replicate it in the original bird model, you could still encounter it twice. If you trained for longer than the number of replications, then you basically see the exact same mask again and the the dynamic, masking actually much more useful. It’s much more ad hoc each time you see a sample, you generate the mask on the fly, so they compare this here and they see that there is a marginal improvement that you’re higher as better marginal improvement in two tasks and a less marginal decrease in performance in one tasks, so they decide that this dynamic masking is of use second thing they investigate is the kind of input format and this next sentence prediction. So as as I already said, the original bird training objective always gets two sentences next to each other and has to decide if the second one follows from the first one. Actually, it doesn’t. It observes two concatenated document segments, which are either sampled contiguously from the same document or from distinct documents. And this is half-and-half. So in addition to the mask language modeling, the model is trained to predict whether the observed documents segments come from the same or distinct document via an auxiliary next sentence prediction loss and they investigate different ways of including or excluding this loss. So first is what they they define if here, if it’s plus and SP, that means that this particular thing includes the next sentence or next segment prediction loss, so they have segments pair plus N SP, which means that each input has a pair of segments and these segments. Now the difference, the distinction between a segment and a sentence is important where at where the sentence is really a natural sentence. A segment can actually be multiple natural sentences, which is what the original bird does. So as long as the combined length is less than 512 tokens, There can also be multiple sentences, but there’s clearly two segments and you have to decide if they follow after each other or not, the second thing they try is the same thing, so the next segment prediction, but now it’s just two sentences. It’s just natural sentences, so it must be one sentence a a a a call, a period sorry and and then the next sentence a period and you have to distinguish these two if they follow or not, then they investigate full sentences, which is they leave away this next segment prediction loss and they simply fill up the 512 tokens with text from the corpus, so each input is packed with full sentences, sampled contiguously from one or more documents and the one or more document means. If you so if you sample text right to your sample here, text, you put all of this in the thing, and you are at the end of a document. You simply continue with the next one and go on until you have to be 512 tokens, so you basically fill fill fill until you have 512 tokens and that’s that’s this this variant here and then in the last variant, you do the same thing that’s called dock sentences, but you, basically you stop at the end so even so, you put all of this in your state and even if you here you stop, and then you have to, you know, be content by simply padding the rest of the 512 tokens or something like this, so you don’t have as much data, but the all the text that you have in one sample is actually contiguous text from the same document, so they hit these four things against each other. This is this table here and as you can see here. The best thing is this doc sentences thing, so on these thing, followed by the full sentences and coding right, so there’s some some ambiguities here, But in general, you can kind of rank them as best second best, and then here third, best and four best, and they conclude that this next segment or next sentence prediction loss here is more hurtful than helpful in the ways we see here, and they say, even though this is most most effective, they in their case, they’d rather go with this one because it’s well, I guess easier to implement you get more data through the model in the same time and the performance decrease. Isn’t that much so but it’s pretty interesting to see that this next next segment next sense prediction isn’t super super helpful in, in actuality here, so removing the NSP loss matches or slightly improves the downstream task performance. This is yeah. In contrast to what the original bird authors found. But you have to keep in mind. This is also non hasn’t a bunch of other changes in then next thing they investigate batch size so batch size. Sorry, batch size pretty seems to be pretty interesting for these large models in that, they love large batch sizes and they actually explore batch sizes, 512 here as a smallest one, and they go up to 8,000 so this they do this actually in a data parallel way, where there are many many machines with many GPUs and they paralyze the data and then they accumulate the gradient of all of these different samples and so they can go up to a batch size about 8k and they find generally that the 2,000 batch size here as you can see, helps to improve the so perplexity lower is better, and the other numbers higher is better helps to to improve the performances. If you control the control for data set size, so the number of times you go through the data set is the same, but if you go it with a larger batch size that seems to help up to a point here that mm seems to be the best they found so again. A marginal improvement you can make by training with larger batch sizes and then this, the last thing they’ve looked at is actually is text encoding. So how do you encode text and the the pit here is basically between bite pair encoding or word piece encoding to that to decide how large your vocabulary is basically, and as I understand it, they didn’t find a much of a difference between the different implementations of the text encoding, so they decide they go with. They decide to go with one. I don’t even remember one, which one I think they go decided to go with by pair encoding instead of word pieces. All right, So they combine all of this into Roberta, which is the robust the optimized Burt approach and they say Roberta is trained with dynamic, masking, so what they showed first full sentence without the next segment prediction loss, large mini-batches a larger byte level byte parent coding as well as, of course, their collection of training data. And then here they also investigate how long to pre train, so if you look at the original Burt models or the Excel net models and then compare it to Robertas, Roberta. This is the original data and they already beat Burt. Yet they do not they do not yet beat. Excel net with that. So if they add data, they get even better actually on par, mostly with the with Excel that if they pre train longer, they get even better, and if they were to say pre-trained even longer, right so that here’s the the number of steps if you’re a number of steps, then match the number of steps that the Excel net does with the same additional data, then or with their additional data. Then you outperform. Excel net as well. So this this kind of just an an overview of this and they evaluate on other downstream tasks, and they basically show that in most of them, they can reach state-of-the-art performance or exceed it with their approach and in conclusion, they basically say well. This only shows that kind of the the gains that these other models make and the reasons why they make gains may be questionable. If you simply pre-trained burped in a better way, you can reach the same performances, so I think the end is not reached yet. Most of all they publish their code their data. I believe I have not looked into this, but definitely check out their repository where this is implemented. It seems pretty easy seems pretty straightforward and that. Was it for me, bye-bye?