ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Bai, Yatong; Dang, Trung; Tran, Dung; Koishida, Kazuhito; Sojoudi, Somayeh

Computer Science > Sound

arXiv:2309.10740 (cs)

[Submitted on 19 Sep 2023 (v1), last revised 24 Jun 2024 (this version, v3)]

Title:ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Authors:Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

View PDF HTML (experimental)

Abstract:Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.

Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2309.10740 [cs.SD]
	(or arXiv:2309.10740v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2309.10740

Submission history

From: Yatong Bai [view email]
[v1] Tue, 19 Sep 2023 16:36:33 UTC (136 KB)
[v2] Thu, 6 Jun 2024 01:32:00 UTC (10,619 KB)
[v3] Mon, 24 Jun 2024 06:51:55 UTC (21,198 KB)

Computer Science > Sound

Title:ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators