TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Tan, Xudong; Ye, Peng; Tu, Chongjun; Cao, Jianjian; Yang, Yaoxin; Zhang, Lin; Zhou, Dongzhan; Chen, Tao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.10501 (cs)

[Submitted on 13 Mar 2025]

Title:TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Authors:Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.10501 [cs.CV]
	(or arXiv:2503.10501v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.10501

Submission history

From: Xudong Tan [view email]
[v1] Thu, 13 Mar 2025 16:04:31 UTC (7,441 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators