[MUSIC] Thanks for watching Henry. Ai labs. This video will explain adversarial audio synthesis, using generative adversarial networks to produce audio data rather than the heavily studied image data. This is done using the wave gain architecture and the spec can model. The motivation is a generative modeling is taking off, thus becoming much more popular and deep learning research. This describes a set of techniques that has a dataset and learns to produce new samples that might belong to that dataset, So in this example in the Big Gann model, they Amad, They have a data set of dog images and the model learns what to produce novel dogs that sort of a line with the data set, but then aren’t too similar to any other sample, so this is really amazing, giving deep learning and artificial intelligence, the power to create data so generative, adversarial networks are the most dominant way of doing this right now and from a quick explanation of how this works, there is a generator network that learns and up sampling procedure from random noise into images, and then the discriminator learns to tell if this image belongs to the data set or it was created by the generator and then in this way, they optimize each other until the generator is eventually able to produce images that resemble the data set. So this technique of Ganz has been enormous LY successful with images. It originally was developed using multi-layer perceptrons from Ian Goodfellow in 2014 and then it was improved by using transpose convolutional layers by Alec Radford and others, And then it took off with things like self attention spectrum normalization. And now we have these state-of-the-art results like the big and model, which uses self attention, spectral normalization, class, conditional projection and tons of parameters, And we have the style Gann, which uses progressive growing and an interesting technique with conditional Batchelor ization so transitioning to audio data. The first question that we should try to figure out is what is audio data as data. Scientists were interested in the structure of data how it’s stored like what are the dimensions of the matrices vectors tensors that store our data, so an image is represented as this height by width by channels tensor Or if it’s a grayscale, it could just be a matrix. So in the image matrices, each pixel takes on 256 values. It’s like 8-bit color, so an audio. What you have is a flat time-serie’s vector that has 44,100 samples per second in audio quality, but we’re not gonna operate on audio quality. It’s gonna be about 16,000 samples per second in in this paper, so each sample has a much larger range of values as well compared to images as you usually use the 16 bits to represent audio compared to 8 bits. Additionally, audio is very different from image data in the inherent structure of it. It’s really psychical because it really consists of a bunch of sine waves compared to images which have like a global global relationship. But they’re not quite as you know. Sequence align like this showing this further is the principal components analysis. When you analyze audio versus image data, so the principal components of image data usually have some kind of edge features. It’s it’s kind of hard to make any sense of this. But the audio principal component analysis shows these cyclic old patterns you see cycles and each of the principal components of the audio data. The dcen was an enormous step forward for applying generative adversarial networks to image data. Lisa ganas pictured here. The idea behind the DCN is you take your random noise input vector and you up sample it using transposed convolutions so transpose convolutions. Look like this, they would have like a dense image like this four by four, and then they would spread it out and then convolve over this to up sample it from the height width spatial resolution from four by four to eight by eight to sixteen by sixteen thirty-two and then the output target of a 64 by 64 RGB image. So in wave again. This is the big idea. It’s actually a really simple idea. They use a similar transposed convolution operation, but theirs is one-dimensional so they take the same thing like this is a series of sampled values from that sine wave thing. And then you know, compared to this, which is like a feature map or they stretch it out into. You know this kind of structure to do the up sampling convolution. So this is the overall architecture it takes in this 100 by this 256 and then. D is the channel parameter that they used to hyper tune their event a hyper parameter to tune their network, so they take in the random vector and they transpose the One-dimensional Convolution series of times until they end up with their final target of the 16,000 samples, which is, which is this parameter right here for the target output of the audio clip. So these are some of the hyper parameters used in the way again like their number of channels batch size, the dimensionality parameter that controls the dimensionality of the intermediate feature maps of the architectures in the generator and discriminator, and then this phase shuffle thing which we will discuss next and then other things like they use the Wasserstein Gamma, which will be covered in the future video of this channel so the authors don’t describe that they use like a Bayesian optimization, or if they use some kind of a technique, they just sort of give you these as a set of recommendations, so phase shuffle is one of the interesting techniques that they present in the paper, And if you have better insight of this than I do, then please share it on the comments, but the way that I interpreted, it is just that it’s a technique to regularize the discriminator so that it doesn’t just focus on really low-level details in the generator, like having a certain sine wave, be off by, like four frames and use that to discriminate the generated and real audio samples. So the spec can also it wasn’t something that I was that interested in, But basically suspected grams are transformations with Fourier transforms into a time frequency domain and so they’re like these images that are really useful for doing like classification tasks with audio with speech data, but they are difficult to invert like, convert this back to an audio sample like a waveform without losing a ton of information, so they do present a technique in this paper to go from spective grands back to waveforms, But I wasn’t interested in it, So this is The data simply used speech commands zero through nine and so the generated samples from the wave. Dan are able to be classified correctly. Sixty-six percent of times showing that the wave. Dan has done a good job of capturing some of the semantics of the data set. So one of the interesting thing to think about these samples is the dimensionality. So the samples per second of the audio sample has this 16,000 16,000 dimensions on the vector for each data point, compared to something like m-mis’s, which would be 32 by 32 Because it’s a matrix. Amanah, 28 by 28 I think is in this. So these are the results that they present using the inception score, showing that their phase shuffle technique significantly improves the performance compared to not using it. This is another funny results of visualization. They did showing that when they played their bird vocalization to a cat, how it responds to the different, different sound synthesized by the model. So now we’re going to present the results, the audio samples that they host on the way on their website. Five, six seven. Thank you for watching this explanation of adversarial audio synthesis and the waveguide architecture. Please leave any comments if you have additional insight as to how these models work and the future of audio generation in general, please subscribe to Henry Ai Labs for more deep learning and artificial intelligence videos. Thanks for watching.