Localizing Visual Sounds the Hard Way

Chen, Honglie; Xie, Weidi; Afouras, Triantafyllos; Nagrani, Arsha; Vedaldi, Andrea; Zisserman, Andrew

Computer Science > Computer Vision and Pattern Recognition

arXiv:2104.02691 (cs)

[Submitted on 6 Apr 2021]

Title:Localizing Visual Sounds the Hard Way

Authors:Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

View PDF

Abstract:The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.

Comments:	CVPR2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2104.02691 [cs.CV]
	(or arXiv:2104.02691v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2104.02691

Submission history

From: Honglie Chen [view email]
[v1] Tue, 6 Apr 2021 17:38:18 UTC (13,710 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Localizing Visual Sounds the Hard Way

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Localizing Visual Sounds the Hard Way

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators