FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Plou, Carlos; Borja, Cesar; Martinez-Cantin, Ruben; Murillo, Ana C.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.19850 (cs)

[Submitted on 25 Mar 2025]

Title:FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Authors:Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

View PDF HTML (experimental)

Abstract:Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average > 1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.19850 [cs.CV]
	(or arXiv:2503.19850v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.19850

Submission history

From: Carlos Plou [view email]
[v1] Tue, 25 Mar 2025 17:17:19 UTC (44,051 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators