Beyond Fixation: Dynamic Window Visual Transformer

Ren, Pengzhen; Li, Changlin; Wang, Guangrun; Xiao, Yun; Du, Qing; Liang, Xiaodan; Chang, Xiaojun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.12856 (cs)

[Submitted on 24 Mar 2022 (v1), last revised 8 Apr 2022 (this version, v2)]

Title:Beyond Fixation: Dynamic Window Visual Transformer

Authors:Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du, Xiaodan Liang, Xiaojun Chang

View PDF

Abstract:Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers \cite{liu2021swin}, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2203.12856 [cs.CV]
	(or arXiv:2203.12856v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.12856
Journal reference:	CVPR2022

Submission history

From: Pengzhen Ren [view email]
[v1] Thu, 24 Mar 2022 05:38:07 UTC (2,496 KB)
[v2] Fri, 8 Apr 2022 06:24:01 UTC (2,498 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Fixation: Dynamic Window Visual Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Fixation: Dynamic Window Visual Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators