Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

Rahimi, Akam; Afouras, Triantafyllos; Zisserman, Andrew

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2501.01518 (eess)

[Submitted on 2 Jan 2025]

Title:Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

Authors:Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman

View PDF HTML (experimental)

Abstract:The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
Cite as:	arXiv:2501.01518 [eess.AS]
	(or arXiv:2501.01518v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2501.01518
Journal reference:	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Submission history

From: Akam Rahimi [view email]
[v1] Thu, 2 Jan 2025 19:53:25 UTC (1,743 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators