AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang, Yulin; Yue, Yang; Lin, Yuanze; Jiang, Haojun; Lai, Zihang; Kulikov, Victor; Orlov, Nikita; Shi, Humphrey; Huang, Gao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.14238 (cs)

[Submitted on 28 Dec 2021 (v1), last revised 12 Apr 2022 (this version, v2)]

Title:AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Authors:Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor Kulikov, Nikita Orlov, Humphrey Shi, Gao Huang

View PDF

Abstract:Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at this https URL.

Comments:	Accepted by CVPR-2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2112.14238 [cs.CV]
	(or arXiv:2112.14238v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.14238

Submission history

From: Yulin Wang [view email]
[v1] Tue, 28 Dec 2021 17:53:38 UTC (3,829 KB)
[v2] Tue, 12 Apr 2022 02:44:14 UTC (3,828 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators