When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Allakhverdov, Eduard; Goncharova, Elizaveta; Kuznetsov, Andrey

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.16660 (cs)

[Submitted on 20 Mar 2025]

Title:When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Authors:Eduard Allakhverdov, Elizaveta Goncharova, Andrey Kuznetsov

View PDF HTML (experimental)

Abstract:Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

Comments:	10 pages, 8 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
MSC classes:	68T10, 68T30, 68T45
ACM classes:	I.2.10
Cite as:	arXiv:2503.16660 [cs.CV]
	(or arXiv:2503.16660v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.16660

Submission history

From: Eduard Allakhverdov [view email]
[v1] Thu, 20 Mar 2025 19:17:08 UTC (5,888 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators