Re-thinking Temporal Search for Long-Form Video Understanding

Ye, Jinhui; Wang, Zihan; Sun, Haosen; Chandrasegaran, Keshigeyan; Durante, Zane; Eyzaguirre, Cristobal; Bisk, Yonatan; Niebles, Juan Carlos; Adeli, Ehsan; Fei-Fei, Li; Wu, Jiajun; Li, Manling

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.02259 (cs)

[Submitted on 3 Apr 2025 (v1), last revised 6 Apr 2025 (this version, v2)]

Title:Re-thinking Temporal Search for Long-Form Video Understanding

Authors:Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li

View PDF

Abstract:Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

Comments:	Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.02259 [cs.CV]
	(or arXiv:2504.02259v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.02259

Submission history

From: Jinhui Ye [view email]
[v1] Thu, 3 Apr 2025 04:03:10 UTC (4,890 KB)
[v2] Sun, 6 Apr 2025 14:10:42 UTC (4,890 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Re-thinking Temporal Search for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Re-thinking Temporal Search for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators