Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Liu, Zhihang; Xie, Chen-Wei; Li, Pandeng; Zhao, Liming; Tang, Longxiang; Zheng, Yun; Liu, Chuanbin; Xie, Hongtao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.16036 (cs)

[Submitted on 20 Mar 2025]

Title:Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Authors:Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie

View PDF HTML (experimental)

Abstract:Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at this https URL.

Comments:	Accepted to CVPR2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2503.16036 [cs.CV]
	(or arXiv:2503.16036v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.16036

Submission history

From: Zhihang Liu [view email]
[v1] Thu, 20 Mar 2025 11:09:18 UTC (2,724 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators