Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction

Yuan, Yuan; Ning, Hailong; Zhao, Bin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.08371v1 (cs)

A newer version of this paper has been withdrawn by Hailong Ning

[Submitted on 17 Sep 2021 (this version), latest version 2 May 2022 (v3)]

Title:Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction

Authors:Yuan Yuan, Hailong Ning, Bin Zhao

View PDF

Abstract:Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene, which is significant and imperative in many vision tasks. Most existing methods only consider visual cues, while neglect the accompanied audio information, which can provide complementary information for the scene understanding. In fact, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by simultaneously sensing these cues. Motivated by this, a bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map by assisting vision modality. The proposed method consists of three parts: 1) audio-visual encoding, 2) audio-visual location, and 3) multi-cues aggregation parts. Firstly, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio-visual location part is devised to locate the sound source in the visual scene by learning the correspondence between audio-visual information. Thirdly, a multi-cues aggregation part is devised to adaptively aggregate audio-visual information and center-bias prior to generate the final visual attention map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art visual attention models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2109.08371 [cs.CV]
	(or arXiv:2109.08371v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.08371

Submission history

From: Hailong Ning [view email]
[v1] Fri, 17 Sep 2021 06:49:43 UTC (1,744 KB)
[v2] Wed, 27 Apr 2022 03:14:32 UTC (1 KB) (withdrawn)
[v3] Mon, 2 May 2022 01:12:04 UTC (1,276 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators