V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Zhao, Yiming; Zeng, Yu; Qi, Yukun; Liu, YaoYang; Chen, Lin; Chen, Zehui; Bao, Xikun; Zhao, Jie; Zhao, Feng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.17736 (cs)

[Submitted on 22 Mar 2025]

Title:V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Authors:Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2503.17736 [cs.CV]
	(or arXiv:2503.17736v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.17736

Submission history

From: Yiming Zhao [view email]
[v1] Sat, 22 Mar 2025 11:30:46 UTC (20,151 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators