Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Liao, Zhenyi; Xie, Qingsong; Zhang, Yanhao; Kong, Zijian; Lu, Haonan; Yang, Zhenyu; Deng, Zhijie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.00883 (cs)

[Submitted on 1 Apr 2025]

Title:Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Authors:Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng

View PDF HTML (experimental)

Abstract:Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.00883 [cs.CV]
	(or arXiv:2504.00883v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.00883

Submission history

From: Zhenyi Liao [view email]
[v1] Tue, 1 Apr 2025 15:11:11 UTC (1,257 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators