PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Grainger, Ryan; Paniagua, Thomas; Song, Xi; Cuntoor, Naresh; Lee, Mun Wai; Wu, Tianfu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.11987 (cs)

[Submitted on 22 Mar 2022 (v1), last revised 7 Apr 2023 (this version, v2)]

Title:PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Authors:Ryan Grainger, Thomas Paniagua, Xi Song, Naresh Cuntoor, Mun Wai Lee, Tianfu Wu

View PDF

Abstract:Vision Transformers (ViTs) are built on the assumption of treating image patches as ``visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at this https URL.

Comments:	CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2203.11987 [cs.CV]
	(or arXiv:2203.11987v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.11987

Submission history

From: Ryan Grainger [view email]
[v1] Tue, 22 Mar 2022 18:28:02 UTC (16,422 KB)
[v2] Fri, 7 Apr 2023 00:46:43 UTC (33,720 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators