Token Sequence Compression for Efficient Multimodal Computing

Omri, Yasmine; Shroff, Parth; Tambe, Thierry

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.17892 (cs)

[Submitted on 24 Apr 2025]

Title:Token Sequence Compression for Efficient Multimodal Computing

Authors:Yasmine Omri, Parth Shroff, Thierry Tambe

View PDF HTML (experimental)

Abstract:The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. We underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2504.17892 [cs.CV]
	(or arXiv:2504.17892v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.17892

Submission history

From: Yasmine Omri [view email]
[v1] Thu, 24 Apr 2025 19:11:10 UTC (5,836 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Token Sequence Compression for Efficient Multimodal Computing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Token Sequence Compression for Efficient Multimodal Computing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators