Audio Conditioning for Music Generation via Discrete Bottleneck Features

Authors: Simon Rouard1,2, Yossi Adi1,3, Jade Copet1, Axel Roebel2, Alexandre Défossez4

Affiliations: 1FAIR Meta, 2IRCAM - Sorbonne Université, 3Hebrew University of Jerusalem, 4Kyutai

Accepted at ISMIR 2024: Read the paper

Code and pretrained model available: on the audiocraft repository.

Your Image

Overview of the model

We present here a MusicGen model that handles style and text conditioning. The style conditioner uses a few seconds audio excerpt as a condition and allows the model to generate music that is similar to it. Our model is trained on 30 seconds which means that we can sample up to 30 seconds long excerpt. We can use the style conditioner by itself, or combine it with the text conditioner. Here are some examples of music generated with the style conditioner only:
Style Condition Ex. 1 Ex. 2
We can use it as well to generate variations of drums beats:
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5
Thanks to a Double Classifier Free Guidance (see down on the page for more explanations), we can mix textual and style conditions that are not aligned. All of the samples use \( \alpha=3 \) and \( \beta=6 \)
Textual description: "Rock opera with drums bass and an electric guitar. Epic feeling"
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4
Textual description: "Chill lofi remix"
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4
Textual description: "8-bit old video game music"
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4
Textual description: "Indian music with traditional instruments"
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4
Textual description: "80s New Wave with synthesizer"
Style Condition Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5
Now, we provide some excerpts to illustrate the tables in the paper:

Comparison with baselines

Here are some samples from our internal test set (used for the human study of Tab.1):

  • The conditioning is the 3 second excerpt used to condition the model
  • Textual inversion: for this method we use a 30 second excerpt containing these 3 seconds to perform the inversion
  • MusicGen in continuation mode: given the conditioning we use MusicGen to continue it without any textual description
  • CLAP conditioner: this model uses the CLAP embedding of the excerpt as conditioning
  • Our model with EnCodec as a Feature Extractor and a 2 levels Residual Vector Quantization
  • Our model with MERT as a Feature Extractor and a 2 levels Residual Vector Quantization

Comparison with baselines
Conditioning Textual Inversion MusicGen Continuation MusicGen w. CLAP conditioner Our Model w. EnCodec and 2 RVQ Our Model w. MERT and 2 RVQ


Influence of the quantization level

Here are some samples from our internal test set (used for the human study of Tab.2). We compare 4 different levels of quantization (q=1, q=2, q=4). The bigger the level is, the larger is the bottleneck. When q increases, the generated music is closer to the conditioning.

  • The conditioning is the 3 second excerpt used to condition the model
  • Our model with MERT as a Feature Extractor and a 1 level Residual Vector Quantization
  • Our model with MERT as a Feature Extractor and a 2 levels Residual Vector Quantization
  • Our model with MERT as a Feature Extractor and a 4 levels Residual Vector Quantization



Influence of the quantization level
Conditioning Our Model w. MERT and 1 RVQ Our Model w. MERT and 2 RVQ Our Model w. MERT and 4 RVQ


Double Classifier Free Guidance: merging text and style



\( l_{\text{double CFG}} = l_{\emptyset} + \alpha [l_{style} + \beta(l_{text, style} - l_{style}) - l_{\emptyset}] \)

During training time, the model is trained with aligned text and style conditioning (i.e. both textual description and the audio of style conditioning comes from the same song). However, at inference time, we can mix different textual description in order to generate great remixes.

Since the text conditioning is less informative than the style conditioner, we use double classifier free guidance (double CFG) to boost the text modality. We showcase the effectiveness of double CFG with a first example. The model used has a MERT feature extractor and 2 RVQ.

We let \( \alpha =3 \) and explore \( \beta \in \{1, 2, 3, 4, 5, 6, 7, 8, 9\} \). Note that \( \beta = 1 \) is equivalent to normal CFG.

Style Conditioning:

Textual Conditioning: "Hip-Hop Remix"

Influence of the \( \beta \) coefficient in double CFG
\( \beta = 1 \) \( \beta = 2 \) \( \beta = 3 \) \( \beta = 4 \) \( \beta = 5 \)
\( \beta = 6 \) \( \beta = 7 \) \( \beta = 8 \) \( \beta = 9 \)
We notice that for low values of \( \beta \), there is almost no drums and bass (the textual description is ignored). For high values of \( \beta \) the quality of the music decreases and the textual description is favored.