M-LLM Based Video Frame Selection for Efficient Video Understanding

Hu, Kai; Gao, Feng; Nie, Xiaohan; Zhou, Peng; Tran, Son; Neiman, Tal; Wang, Lingyun; Shah, Mubarak; Hamid, Raffay; Yin, Bing; Chilimbi, Trishul

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.19680 (cs)

[Submitted on 27 Feb 2025 (v1), last revised 26 Mar 2025 (this version, v2)]

Title:M-LLM Based Video Frame Selection for Efficient Video Understanding

Authors:Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul Chilimbi

View PDF HTML (experimental)

Abstract:Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.19680 [cs.CV]
	(or arXiv:2502.19680v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.19680

Submission history

From: Feng Gao [view email]
[v1] Thu, 27 Feb 2025 01:44:13 UTC (4,059 KB)
[v2] Wed, 26 Mar 2025 21:14:41 UTC (4,058 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M-LLM Based Video Frame Selection for Efficient Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M-LLM Based Video Frame Selection for Efficient Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators