Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Xu, Yufei; Zhang, Jing; Zhang, Qiming; Tao, Dacheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.01785 (cs)

[Submitted on 3 Nov 2022 (v1), last revised 8 Nov 2022 (this version, v2)]

Title:Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Authors:Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

View PDF

Abstract:Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at this https URL.

Comments:	Tech report, work in progress
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.01785 [cs.CV]
	(or arXiv:2211.01785v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.01785

Submission history

From: Yufei Xu [view email]
[v1] Thu, 3 Nov 2022 13:19:23 UTC (125 KB)
[v2] Tue, 8 Nov 2022 15:07:29 UTC (125 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators