Dynamic Cross Attention for Audio-Visual Person Verification

Praveen, R. Gnana; Alam, Jahangir

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.04661 (cs)

[Submitted on 7 Mar 2024 (v1), last revised 22 Apr 2024 (this version, v3)]

Title:Dynamic Cross Attention for Audio-Visual Person Verification

Authors:R. Gnana Praveen, Jahangir Alam

View PDF HTML (experimental)

Abstract:Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.

Comments:	Accepted to FG2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2403.04661 [cs.CV]
	(or arXiv:2403.04661v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.04661

Submission history

From: Rajasekar Gnana Praveen [view email]
[v1] Thu, 7 Mar 2024 17:07:51 UTC (214 KB)
[v2] Tue, 12 Mar 2024 20:52:02 UTC (214 KB)
[v3] Mon, 22 Apr 2024 14:04:55 UTC (1,195 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Dynamic Cross Attention for Audio-Visual Person Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Dynamic Cross Attention for Audio-Visual Person Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators