BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Liu, Shuming; Zhao, Chen; Xu, Tianqi; Ghanem, Bernard

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.21483 (cs)

[Submitted on 27 Mar 2025]

Title:BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Authors:Shuming Liu, Chen Zhao, Tianqi Xu, Bernard Ghanem

View PDF HTML (experimental)

Abstract:Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at this https URL.

Comments:	Accepted to CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.21483 [cs.CV]
	(or arXiv:2503.21483v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.21483

Submission history

From: Shuming Liu [view email]
[v1] Thu, 27 Mar 2025 13:18:40 UTC (1,473 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators