Transformers in Vision: A Survey

Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak

doi:10.1145/3505244

Computer Science > Computer Vision and Pattern Recognition

arXiv:2101.01169 (cs)

[Submitted on 4 Jan 2021 (v1), last revised 19 Jan 2022 (this version, v5)]

Title:Transformers in Vision: A Survey

Authors:Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah

View PDF

Abstract:Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

Comments:	30 pages (Accepted in ACM Computing Surveys December 2021)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2101.01169 [cs.CV]
	(or arXiv:2101.01169v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2101.01169
Related DOI:	https://doi.org/10.1145/3505244

Submission history

From: Salman Khan Dr. [view email]
[v1] Mon, 4 Jan 2021 18:57:24 UTC (5,383 KB)
[v2] Mon, 22 Feb 2021 11:40:11 UTC (11,698 KB)
[v3] Wed, 8 Sep 2021 04:44:04 UTC (11,698 KB)
[v4] Sun, 3 Oct 2021 12:21:35 UTC (13,054 KB)
[v5] Wed, 19 Jan 2022 05:49:50 UTC (13,054 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Transformers in Vision: A Survey

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Transformers in Vision: A Survey

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators