Audio Conditioning for Music Generation via Discrete Bottleneck Features

Overview of the model

We present here a MusicGen model that handles style and text conditioning. The style conditioner uses a few seconds audio excerpt as a condition and allows the model to generate music that is similar to it. Our model is trained on 30 seconds which means that we can sample up to 30 seconds long excerpt. We can use the style conditioner by itself, or combine it with the text conditioner. Here are some examples of music generated with the style conditioner only:

Style Condition Ex. 1 Ex. 2

We can use it as well to generate variations of drums beats:

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5

Thanks to a Double Classifier Free Guidance (see down on the page for more explanations), we can mix textual and style conditions that are not aligned. All of the samples use \( \alpha=3 \) and \( \beta=6 \)

Textual description: "Rock opera with drums bass and an electric guitar. Epic feeling"

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4

Textual description: "Chill lofi remix"

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4

Textual description: "8-bit old video game music"

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4

Textual description: "Indian music with traditional instruments"

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4

Textual description: "80s New Wave with synthesizer"

Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5

Now, we provide some excerpts to illustrate the tables in the paper:

Textual description: "Rock opera with drums bass and an electric guitar. Epic feeling"
Style Condition	Ex. 1	Ex. 2	Ex. 3	Ex. 4

Textual description: "Chill lofi remix"
Style Condition	Ex. 1	Ex. 2	Ex. 3	Ex. 4

Textual description: "8-bit old video game music"
Style Condition	Ex. 1	Ex. 2	Ex. 3	Ex. 4

Textual description: "Indian music with traditional instruments"
Style Condition	Ex. 1	Ex. 2	Ex. 3	Ex. 4

Textual description: "80s New Wave with synthesizer"
Style Condition	Ex. 1	Ex. 2	Ex. 3	Ex. 4	Ex. 5

Comparison with baselines

Here are some samples from our internal test set (used for the human study of Tab.1):

The conditioning is the 3 second excerpt used to condition the model
Textual inversion: for this method we use a 30 second excerpt containing these 3 seconds to perform the inversion
MusicGen in continuation mode: given the conditioning we use MusicGen to continue it without any textual description
CLAP conditioner: this model uses the CLAP embedding of the excerpt as conditioning
Our model with EnCodec as a Feature Extractor and a 2 levels Residual Vector Quantization
Our model with MERT as a Feature Extractor and a 2 levels Residual Vector Quantization

Conditioning	Textual Inversion	MusicGen Continuation	MusicGen w. CLAP conditioner	Our Model w. EnCodec and 2 RVQ	Our Model w. MERT and 2 RVQ
Comparison with baselines

Influence of the quantization level

Here are some samples from our internal test set (used for the human study of Tab.2). We compare 4 different levels of quantization (q=1, q=2, q=4). The bigger the level is, the larger is the bottleneck. When q increases, the generated music is closer to the conditioning.

The conditioning is the 3 second excerpt used to condition the model
Our model with MERT as a Feature Extractor and a 1 level Residual Vector Quantization
Our model with MERT as a Feature Extractor and a 2 levels Residual Vector Quantization
Our model with MERT as a Feature Extractor and a 4 levels Residual Vector Quantization

Conditioning	Our Model w. MERT and 1 RVQ	Our Model w. MERT and 2 RVQ	Our Model w. MERT and 4 RVQ
Influence of the quantization level

Double Classifier Free Guidance: merging text and style

\( l_{\text{double CFG}} = l_{\emptyset} + \alpha [l_{style} + \beta(l_{text, style} - l_{style}) - l_{\emptyset}] \)

During training time, the model is trained with aligned text and style conditioning (i.e. both textual description and the audio of style conditioning comes from the same song). However, at inference time, we can mix different textual description in order to generate great remixes.

Since the text conditioning is less informative than the style conditioner, we use double classifier free guidance (double CFG) to boost the text modality. We showcase the effectiveness of double CFG with a first example. The model used has a MERT feature extractor and 2 RVQ.

We let \( \alpha =3 \) and explore \( \beta \in \{1, 2, 3, 4, 5, 6, 7, 8, 9\} \). Note that \( \beta = 1 \) is equivalent to normal CFG.

Style Conditioning:

Textual Conditioning: "Hip-Hop Remix"

\( \beta = 1 \)	\( \beta = 2 \)	\( \beta = 3 \)	\( \beta = 4 \)	\( \beta = 5 \)
Influence of the \( \beta \) coefficient in double CFG

\( \beta = 6 \)	\( \beta = 7 \)	\( \beta = 8 \)	\( \beta = 9 \)

We notice that for low values of \( \beta \), there is almost no drums and bass (the textual description is ignored). For high values of \( \beta \) the quality of the music decreases and the textual description is favored.