Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Ryu, Hyeonggon; Kim, Seongyu; Chung, Joon Son; Senocak, Arda

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.18880 (cs)

[Submitted on 24 Mar 2025]

Title:Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Authors:Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak

View PDF HTML (experimental)

Abstract:We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.

Comments:	CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2503.18880 [cs.CV]
	(or arXiv:2503.18880v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.18880

Submission history

From: Arda Senocak [view email]
[v1] Mon, 24 Mar 2025 16:56:04 UTC (3,801 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators