jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Koukounas, Andreas; Mastrapas, Georgios; Eslami, Sedigheh; Wang, Bo; Akram, Mohammad Kalim; Günther, Michael; Mohr, Isabelle; Sturua, Saba; Wang, Nan; Xiao, Han

Computer Science > Computation and Language

arXiv:2412.08802 (cs)

[Submitted on 11 Dec 2024 (v1), last revised 24 Apr 2025 (this version, v2)]

Title:jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Authors:Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at this https URL.

Comments:	30 pages, 1-10 main paper, 10-12 refs, 12-30 benchmarks
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
MSC classes:	68T50
ACM classes:	I.2.7; I.2.10
Cite as:	arXiv:2412.08802 [cs.CL]
	(or arXiv:2412.08802v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.08802

Submission history

From: Han Xiao [view email]
[v1] Wed, 11 Dec 2024 22:28:12 UTC (755 KB)
[v2] Thu, 24 Apr 2025 16:22:33 UTC (774 KB)

Computer Science > Computation and Language

Title:jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators