VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jung, Jaemin; Ahn, Junseok; Jung, Chaeyoung; Nguyen, Tan Dat; Jang, Youngjoon; Chung, Joon Son

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2412.19259 (eess)

[Submitted on 26 Dec 2024]

Title:VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Authors:Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

View PDF HTML (experimental)

Abstract:We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

Comments:	Accepted to ICASSP 2025
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2412.19259 [eess.AS]
	(or arXiv:2412.19259v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2412.19259

Submission history

From: Jaemin Jung [view email]
[v1] Thu, 26 Dec 2024 15:52:58 UTC (174 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators