Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Fei, Hao; Wu, Shengqiong; Ji, Wei; Zhang, Hanwang; Zhang, Meishan; Lee, Mong-Li; Hsu, Wynne

Computer Science > Artificial Intelligence

arXiv:2501.03230 (cs)

[Submitted on 7 May 2024]

Title:Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Authors:Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

View PDF HTML (experimental)

Abstract:Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Project is open at this https URL

Comments:	Accepted by ICML 2024
Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.03230 [cs.AI]
	(or arXiv:2501.03230v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2501.03230

Submission history

From: Meishan Zhang [view email]
[v1] Tue, 7 May 2024 11:55:10 UTC (1,130 KB)

Computer Science > Artificial Intelligence

Title:Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators