Weakly-supervised Audio-visual Sound Source Detection and Separation

Rahman, Tanzila; Sigal, Leonid

Computer Science > Computer Vision and Pattern Recognition

arXiv:2104.02606 (cs)

[Submitted on 25 Mar 2021]

Title:Weakly-supervised Audio-visual Sound Source Detection and Separation

Authors:Tanzila Rahman, Leonid Sigal

View PDF

Abstract:Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels. Unlike other recent visually-guided audio source separation frameworks, our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals. Specifically, we introduce weakly-supervised object segmentation in the context of sound separation. We also formulate spectrogram mask prediction using a set of learned mask bases, which combine using coefficients conditioned on the output of object segmentation , a design that facilitates separation. Extensive experiments on the MUSIC dataset show that our proposed approach outperforms state-of-the-art methods on visually guided sound source separation and sound denoising.

Comments:	4 figures, 6 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2104.02606 [cs.CV]
	(or arXiv:2104.02606v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2104.02606
Journal reference:	IEEE International Conference on Multimedia and Expo (ICME) 2021

Submission history

From: Tanzila Rahman [view email]
[v1] Thu, 25 Mar 2021 10:17:55 UTC (5,270 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2021-04

Change to browse by:

cs
cs.SD
eess
eess.AS
eess.IV

References & Citations

DBLP - CS Bibliography

listing | bibtex

Tanzila Rahman
Leonid Sigal

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly-supervised Audio-visual Sound Source Detection and Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly-supervised Audio-visual Sound Source Detection and Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators