Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Ning, Hailong; Zhao, Bin; Hu, Zhanxuan; He, Lang; Pei, Ercheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.08371 (cs)

[Submitted on 17 Sep 2021 (v1), last revised 2 May 2022 (this version, v3)]

Title:Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Authors:Hailong Ning, Bin Zhao, Zhanxuan Hu, Lang He, Ercheng Pei

View PDF

Abstract:The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive the dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglect the accompanied audio information, which can provide complementary information for the scene understanding. In fact, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio-visual collaborative representation learning method is proposed for the DSP task, which explores the audio modality to better predict the dynamic saliency map by assisting vision modality. The proposed method consists of three parts: 1) audio-visual encoding, 2) audio-visual location, and 3) collaborative integration parts. Firstly, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio-visual location part is devised to locate the sound source in the visual scene by learning the correspondence between audio-visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate audio-visual information and center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2109.08371 [cs.CV]
	(or arXiv:2109.08371v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.08371

Submission history

From: Hailong Ning [view email]
[v1] Fri, 17 Sep 2021 06:49:43 UTC (1,744 KB)
[v2] Wed, 27 Apr 2022 03:14:32 UTC (1 KB) (withdrawn)
[v3] Mon, 2 May 2022 01:12:04 UTC (1,276 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators