Transcript:
Hi, everyone, my name is Manuel Parente. And today I’m going to present our in-text speech paper about asteroid. The pythog-based audio source separation toolkit for researchers. I first explained motivations behind the creation of asteroid its features and I’ll finally present the results we reproduced with it. This paper is joint work with great people from all over the world. And I wanted to thank them all for it. So what is audio source separation audio source separation aims to extract source signals of interest from an audio stream involving several sound sources. It has been tackled for decades using probabilistic modeling, non-negative matrix, factorization or beam forming and open source toolkits such as fast heart, many ears or open blizzard have been helping for it In recent years, deep learning approaches have shown to be largely superior, at least in the single channel source operation. This is a hand-picked list of source separation papers since 2016 that all use deep learning. This is to show that this is really a fast evolving field and that even if this is a small list, there are a lot of reapers. Most of these papers adopt the encoder mascara decoder approach, which was first introduced in the Tasman paper in 2017 while asteroid is not limited to that. This is our starting point. Several open source data sets have also been introduced during those years to support research and comparability. Here is a table summarizing the main ones. The underlying tasks include speech enhancement and their operation single channel and multi multi-channel speed, separation, music and ambient sound separation. For example, we believe that an open source toolkit providing standardized implementation for data set generation on one side and architectures and evaluation metrics on the other side would highly benefit the community by promoting reproducibility and offering basis for easy experimentation, experimentation and improvement. If we look at the past and present open source software’s, both general purpose or domain specific ones have had huge impact on both research and industry speeding up innovation. For example, we would like something similar for source operation, but there are already great resources you will see, why would we need something new? Both nasal and onsyn, for example, provide training and evaluation, but no data preparation and they are not configurable enough from the command line. Openon mix on the other hand has all the ingredients we would like. But only for music, source operation and isolated implementations are well isolated, so we lose time jumping from one to another and adapting to code that we don’t know, obviously we encourage you to try them out and build your own opinion about them. So seeing these limitations of current software, our design principles are the following use as much native python code as possible and integrate with existing code bases without changes. Both of these are here to make it easy to start using asteroid for newcomers. Another design principle is to provide all steps from data preparation to evaluation to promote reproducible research and finally to enable recipes to be configurable from the command line to enable parallelization of experimentation over GPU clusters. Okay, now I’ll review asteroid’s current features before presenting the results we produced. So as we said, our starting point is the encoder masker decoder framework Filter banks are the main levels of the encoder decoder block and we chose to make them consistent with the NN com1d with the 1d convolution layer from pytorch so that the user will be familiar with it and on the other side, we have the masker architectures, which are doing the actual separation, and we support quite a light range of those from the literature whenever sources to be separated are of the same nature, for example, separate speech against speech. We need to use permutation environment training. The native approach Computes pairwise losses. Uh, N factorial times and using a simple memorization trick, we can reduce it to N Square last computation and use the Hungarian algorithm to speed up things. Even more, Pete. Last wrapper can turn any simple loss into a permutation invariant one in two or three lines. It’s really simple. You should check it out on the top of this wrapper. We have we support some loss. Functions, which are usually, uh, which are commonly used in source separation or speech announcements, data sets are also a central part of source separation experimentation. We provide data, preparation, recipes and data loaders from most common source separation or speech enhancement data sets and try to expand this coverage even more regarding the recipes they follow a common pattern. You can see on the right. We download the data. We prepare it for you. The model is trained. We perform evaluation and then we can share the models and have the inference engine to. We will see that a little bit later for training, We use a thin wrapper around pytorch lightning, so we can benefit of all its cool features such as distributed training or mixed precision training extra lets. Jump to a notebook to see how easy it can be to set up an experiment with asteroid, We have the latest version of asteroid installed. Let’s now define our data model OPTIMIZER and loss function to start deep learning experiment for fast experimentation. We set up a mini speech Separation data set. We can define comfortaznet as our model and use a classic optimizer. We use our heat loss wrapper to compute permutation variant losses. Our system class pulls it all together, so we can train with Pi Torch lightning and all its cool features, Of course, we showed condensed version here, but all levels of abstractions are exposed to the user, which makes it suitable for research. We also created a model hub on xenodo, to which pre-trained models can be easily uploaded after training and also download it for further usage by the community. You can copy paste the name of the model and directly load in python using python. The loading looks like this. We use Dprn and testnet from pre-trained the name of the model We found on Zeldo and that’s it. We have the model. We also have an inference command line interface that downloads the required model and performs inference on pro provided wav files. And you can use regress as well. Okay, now that we’ve seen asteroids what asteroid provides and what we can do with it, Let’s have a look at the results we obtain when replicating some important papers in the field. This first comparison is done on the seminal was regional zero mixed data set where we train six different models, evaluate the performance in terms of scaling variance signal to noise signal to distortion ratio, sorry, the improvement of scaling variant signal to distortion ratio and compare with the numbers, originally reporting the papers we can see that in most cases, most cases, we can reproduce the results and in some cases improve on the original performance. The yeah. The second comparison shows an improved task model applied to the reverberant one data set, where we systematically outperform the original results while trying to match the experimental conditions as much as we can to illustrate the simplicity of experimentation. You can see the command. We use to train the formulas. This is just a simple, looping batch. We will loop over the task and keep the model fixed lets. Wrap it up now. We’ve seen that audio source operation is a very active field and in deep learning since at least 2016 It’s it’s moving really fast. Uh, we introduce asteroid. That is meant to support research and lower the barrier to entry in the field. We provide all necessary building blocks for easy experimentation, along with recipes to reproduce important papers. We’ve seen that the result results of these recipes at least match, the results originally reported. So you’re safe to use it regarding what’s next we’re planning to refactor recipes to enable mix and match of data sets and architectures through a powerful command line interface, Well also focus on asterisk models, jitability for faster inference, engine, broader separation and more and were planning to make a recipe with an interface to espinet or the upcoming version of kaldi in the future so that we can make an interface between source separation and recognition to conclude we’d like to invite you to join us. Make asterisk better and more suited to your use case. Our team is very inclusive and be happy to guide you through your contributions. All right, that’s it for me. See you next time.