TCFormer: Visual Recognition via Token Clustering Transformer

Zeng, Wang; Jin, Sheng; Xu, Lumin; Liu, Wentao; Qian, Chen; Ouyang, Wanli; Luo, Ping; Wang, Xiaogang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.11321 (cs)

[Submitted on 16 Jul 2024]

Title:TCFormer: Visual Recognition via Token Clustering Transformer

Authors:Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

View PDF HTML (experimental)

Abstract:Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.11321 [cs.CV]
	(or arXiv:2407.11321v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.11321

Submission history

From: Wang Zeng [view email]
[v1] Tue, 16 Jul 2024 02:26:18 UTC (21,525 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TCFormer: Visual Recognition via Token Clustering Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TCFormer: Visual Recognition via Token Clustering Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators