VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Zhang, Boqiang; Li, Kehan; Cheng, Zesen; Hu, Zhiqiang; Yuan, Yuqian; Chen, Guanzheng; Leng, Sicong; Jiang, Yuming; Zhang, Hang; Li, Xin; Jin, Peng; Zhang, Wenqi; Wang, Fan; Bing, Lidong; Zhao, Deli

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.13106 (cs)

[Submitted on 22 Jan 2025]

Title:VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Authors:Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

View PDF HTML (experimental)

Abstract:In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) vision-centric alignment stage, which warms up the vision encoder and projector; 2) vision-language pretraining stage, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) multi-task fine-tuning stage, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) video-centric fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.

Comments:	BZ, KL, ZC, ZH, YY, GC, SL, YJ, HZ, and XL contributed equally to this project. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.13106 [cs.CV]
	(or arXiv:2501.13106v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.13106

Submission history

From: Boqiang Zhang [view email]
[v1] Wed, 22 Jan 2025 18:59:46 UTC (9,949 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators