R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yang, Yi; He, Xiaoxuan; Pan, Hongkun; Jiang, Xiyan; Deng, Yan; Yang, Xingtao; Lu, Haoyu; Yin, Dacheng; Rao, Fengyun; Zhu, Minfeng; Zhang, Bo; Chen, Wei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.10615 (cs)

[Submitted on 13 Mar 2025 (v1), last revised 18 Mar 2025 (this version, v2)]

Title:R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Authors:Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen

View PDF HTML (experimental)

Abstract:Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Comments:	Code and Model: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.10615 [cs.CV]
	(or arXiv:2503.10615v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.10615

Submission history

From: Yang Yi [view email]
[v1] Thu, 13 Mar 2025 17:56:05 UTC (17,180 KB)
[v2] Tue, 18 Mar 2025 08:52:34 UTC (17,180 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators