PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Wang, Junjie; Zhang, Yin; Ji, Yatai; Zhang, Yuxiang; Jiang, Chunyang; Wang, Yubo; Zhu, Kang; Wang, Zekun; Wang, Tiezhen; Huang, Wenhao; Fu, Jie; Chen, Bei; Lin, Qunshu; Liu, Minghao; Zhang, Ge; Chen, Wenhu

Computer Science > Artificial Intelligence

arXiv:2406.13923 (cs)

[Submitted on 20 Jun 2024]

Title:PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Authors:Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen

View PDF HTML (experimental)

Abstract:Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2406.13923 [cs.AI]
	(or arXiv:2406.13923v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2406.13923

Submission history

From: Junjie Wang [view email]
[v1] Thu, 20 Jun 2024 01:43:08 UTC (1,463 KB)

Computer Science > Artificial Intelligence

Title:PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators