E5-V: Universal Embeddings with Multimodal Large Language Models

Jiang, Ting; Song, Minghui; Zhang, Zihan; Huang, Haizhen; Deng, Weiwei; Sun, Feng; Zhang, Qi; Wang, Deqing; Zhuang, Fuzhen

Computer Science > Computation and Language

arXiv:2407.12580 (cs)

[Submitted on 17 Jul 2024]

Title:E5-V: Universal Embeddings with Multimodal Large Language Models

Authors:Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.

Comments:	Code and models are available at this https URL
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Cite as:	arXiv:2407.12580 [cs.CL]
	(or arXiv:2407.12580v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.12580

Submission history

From: Ting Jiang [view email]
[v1] Wed, 17 Jul 2024 14:04:12 UTC (1,538 KB)

Computer Science > Computation and Language

Title:E5-V: Universal Embeddings with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:E5-V: Universal Embeddings with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators