Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Zhang, Jisi; Parada, Pablo Peso; Jalal, Md Asif; Saravanan, Karthikeyan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2501.14680 (eess)

[Submitted on 24 Jan 2025 (v1), last revised 27 Jan 2025 (this version, v2)]

Title:Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Authors:Jisi Zhang, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan

View PDF HTML (experimental)

Abstract:Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.

Comments:	Accepted at ICASSP 2025
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2501.14680 [eess.AS]
	(or arXiv:2501.14680v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2501.14680

Submission history

From: Pablo Peso Parada [view email]
[v1] Fri, 24 Jan 2025 17:57:47 UTC (254 KB)
[v2] Mon, 27 Jan 2025 10:41:29 UTC (254 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators