FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Wang, Haicheng; Yu, Zhemeng; Spadaro, Gabriele; Ju, Chen; Quétu, Victor; Tartaglione, Enzo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.02430 (cs)

[Submitted on 5 Jan 2025]

Title:FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Authors:Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Enzo Tartaglione

View PDF HTML (experimental)

Abstract:Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.02430 [cs.CV]
	(or arXiv:2501.02430v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.02430

Submission history

From: Chen Ju [view email]
[v1] Sun, 5 Jan 2025 03:28:45 UTC (2,857 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators