TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Zhang, Xingjian; Wen, Siwei; Wu, Wenjun; Huang, Lei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.09641 (cs)

[Submitted on 13 Apr 2025]

Title:TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Authors:Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

View PDF HTML (experimental)

Abstract:Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.09641 [cs.CV]
	(or arXiv:2504.09641v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.09641

Submission history

From: Xingjian Zhang [view email]
[v1] Sun, 13 Apr 2025 16:32:49 UTC (14,025 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators