Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Liu, Tianyu; Zhang, Peng; Huang, Wei; Zha, Yufei; You, Tao; Zhang, Yanning

doi:10.1145/3581783.3612502

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.04767 (cs)

[Submitted on 9 Aug 2023]

Title:Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Authors:Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang

View PDF

Abstract:Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at this https URL

Comments:	Accepted to ACM Multimedia 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2308.04767 [cs.CV]
	(or arXiv:2308.04767v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.04767
Related DOI:	https://doi.org/10.1145/3581783.3612502

Submission history

From: Tianyu Liu [view email]
[v1] Wed, 9 Aug 2023 07:55:12 UTC (1,131 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators