ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Zhao, Zijia; Guo, Longteng; Yue, Tongtian; Chen, Sihan; Shao, Shuai; Zhu, Xinxin; Yuan, Zehuan; Liu, Jing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.16103 (cs)

[Submitted on 25 May 2023]

Title:ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Authors:Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu

View PDF

Abstract:Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2305.16103 [cs.CV]
	(or arXiv:2305.16103v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.16103

Submission history

From: Zijia Zhao [view email]
[v1] Thu, 25 May 2023 14:34:08 UTC (9,143 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators