Compress image to patches for Vision Transformer

Zhao, Xinfeng; Sun, Yaoru

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.10120 (cs)

[Submitted on 14 Feb 2025 (v1), last revised 17 Feb 2025 (this version, v2)]

Title:Compress image to patches for Vision Transformer

Authors:Xinfeng Zhao, Yaoru Sun

View PDF HTML (experimental)

Abstract:The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically. This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models. Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original. This design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model's accuracy by introducing the inductive bias properties of CNN. The ViT model's precision is markedly enhanced. When trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model's computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.

Comments:	15 pages,5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.10120 [cs.CV]
	(or arXiv:2502.10120v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.10120

Submission history

From: Xinfeng Zhao [view email]
[v1] Fri, 14 Feb 2025 12:40:37 UTC (382 KB)
[v2] Mon, 17 Feb 2025 07:35:28 UTC (382 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Compress image to patches for Vision Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Compress image to patches for Vision Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators