Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ephrat, Ariel; Mosseri, Inbar; Lang, Oran; Dekel, Tali; Wilson, Kevin; Hassidim, Avinatan; Freeman, William T.; Rubinstein, Michael

doi:10.1145/3197517.3201357

Computer Science > Sound

arXiv:1804.03619 (cs)

[Submitted on 10 Apr 2018 (v1), last revised 9 Aug 2018 (this version, v2)]

Title:Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Authors:Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein

View PDF

Abstract:We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

Comments:	Accepted to SIGGRAPH 2018. Project webpage: this https URL
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:1804.03619 [cs.SD]
	(or arXiv:1804.03619v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.1804.03619
Journal reference:	ACM Trans. Graph. 37(4): 112:1-112:11 (2018)
Related DOI:	https://doi.org/10.1145/3197517.3201357

Submission history

From: Ariel Ephrat [view email]
[v1] Tue, 10 Apr 2018 16:28:59 UTC (5,279 KB)
[v2] Thu, 9 Aug 2018 21:22:37 UTC (9,019 KB)

Computer Science > Sound

Title:Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators