CoS: Chain-of-Shot Prompting for Long Video Understanding

Hu, Jian; Cheng, Zixu; Si, Chenyang; Li, Wei; Gong, Shaogang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.06428 (cs)

[Submitted on 10 Feb 2025 (v1), last revised 11 Feb 2025 (this version, v2)]

Title:CoS: Chain-of-Shot Prompting for Long Video Understanding

Authors:Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

View PDF HTML (experimental)

Abstract:Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in this https URL.

Comments:	A training-free test-time optimisation approach for long video understanding
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.06428 [cs.CV]
	(or arXiv:2502.06428v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.06428

Submission history

From: Jian Hu [view email]
[v1] Mon, 10 Feb 2025 13:03:05 UTC (12,879 KB)
[v2] Tue, 11 Feb 2025 14:59:25 UTC (12,879 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CoS: Chain-of-Shot Prompting for Long Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CoS: Chain-of-Shot Prompting for Long Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators