MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Hong, Wenyi; Cheng, Yean; Yang, Zhuoyi; Wang, Weihan; Wang, Lefan; Gu, Xiaotao; Huang, Shiyu; Dong, Yuxiao; Tang, Jie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.02955 (cs)

[Submitted on 6 Jan 2025]

Title:MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Authors:Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

View PDF HTML (experimental)

Abstract:In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: this https URL .

Comments:	20 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.02955 [cs.CV]
	(or arXiv:2501.02955v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.02955

Submission history

From: Wenyi Hong [view email]
[v1] Mon, 6 Jan 2025 11:57:38 UTC (8,699 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators