Efficient Vision Transformer for Human Pose Estimation via Patch Selection

Kinfu, Kaleab A.; Vidal, Rene

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.04225v2 (cs)

[Submitted on 7 Jun 2023 (v1), last revised 22 Nov 2023 (this version, v2)]

Title:Efficient Vision Transformer for Human Pose Estimation via Patch Selection

Authors:Kaleab A. Kinfu, Rene Vidal

View PDF

Abstract:While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images. In this paper, we propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches while disregarding others. The first two methods leverage a lightweight pose estimation network to guide the patch selection process, while the third method utilizes a set of learnable joint tokens to ensure that the selected patches contain the most important information about body joints. Experiments across six benchmarks show that our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.

Comments:	BMVC 2023 Oral Paper: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2306.04225 [cs.CV]
	(or arXiv:2306.04225v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.04225
Journal reference:	34th British Machine Vision Conference 2023

Submission history

From: Kaleab A. Kinfu [view email]
[v1] Wed, 7 Jun 2023 08:02:17 UTC (3,540 KB)
[v2] Wed, 22 Nov 2023 12:35:08 UTC (2,286 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Vision Transformer for Human Pose Estimation via Patch Selection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Vision Transformer for Human Pose Estimation via Patch Selection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators