Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Chen, Yi; Ge, Yuying; Wang, Rui; Ge, Yixiao; Qiu, Lu; Shan, Ying; Liu, Xihui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.24376 (cs)

[Submitted on 31 Mar 2025]

Title:Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu

View PDF HTML (experimental)

Abstract:Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Comments:	Technical Report (In Progress); Code released at: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2503.24376 [cs.CV]
	(or arXiv:2503.24376v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.24376

Submission history

From: Yi Chen [view email]
[v1] Mon, 31 Mar 2025 17:55:23 UTC (3,436 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators