Audio-Visual Transformer Based Crowd Counting

Sajid, Usman; Chen, Xiangyu; Sajid, Hasan; Kim, Taejoon; Wang, Guanghui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.01926 (cs)

[Submitted on 4 Sep 2021]

Title:Audio-Visual Transformer Based Crowd Counting

Authors:Usman Sajid, Xiangyu Chen, Hasan Sajid, Taejoon Kim, Guanghui Wang

View PDF

Abstract:Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2109.01926 [cs.CV]
	(or arXiv:2109.01926v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.01926

Submission history

From: Usman Sajid [view email]
[v1] Sat, 4 Sep 2021 20:25:35 UTC (3,654 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2021-09

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Xiangyu Chen
Taejoon Kim
Guanghui Wang

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Transformer Based Crowd Counting

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Transformer Based Crowd Counting

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators