VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Wang, Xiaohan; Zhang, Yuhui; Zohar, Orr; Yeung-Levy, Serena

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.10517 (cs)

[Submitted on 15 Mar 2024]

Title:VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Authors:Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

View PDF HTML (experimental)

Abstract:Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2403.10517 [cs.CV]
	(or arXiv:2403.10517v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.10517

Submission history

From: Xiaohan Wang [view email]
[v1] Fri, 15 Mar 2024 17:57:52 UTC (1,761 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators