2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Zhang, Wenqi; Zhang, Hang; Li, Xin; Sun, Jiashuo; Shen, Yongliang; Lu, Weiming; Zhao, Deli; Zhuang, Yueting; Bing, Lidong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.00958 (cs)

[Submitted on 1 Jan 2025 (v1), last revised 3 Jan 2025 (this version, v2)]

Title:2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Authors:Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing

View PDF HTML (experimental)

Abstract:Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{this https URL}}.

Comments:	Under review
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2501.00958 [cs.CV]
	(or arXiv:2501.00958v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.00958

Submission history

From: Wenqi Zhang [view email]
[v1] Wed, 1 Jan 2025 21:29:37 UTC (10,896 KB)
[v2] Fri, 3 Jan 2025 13:25:27 UTC (10,896 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators