Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Yin, Wenzhe; Xiao, Zehao; Zhou, Pan; Yu, Shujian; Shen, Jiayi; Sonke, Jan-Jakob; Gavves, Efstratios

Computer Science > Machine Learning

arXiv:2502.17028 (cs)

[Submitted on 24 Feb 2025]

Title:Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Authors:Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves

View PDF HTML (experimental)

Abstract:Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP maximize the mutual information mainly by aligning pairwise samples across modalities while overlooking the distributional differences, leading to suboptimal alignment with modality gaps. In this paper, to overcome the limitation, we propose CS-Aligner, a novel and straightforward framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. In the proposed framework, we find that the CS divergence and mutual information serve complementary roles in multimodal alignment, capturing both the global distribution information of each modality and the pairwise semantic relationships, yielding tighter and more precise alignment. Moreover, CS-Aligher enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.17028 [cs.LG]
	(or arXiv:2502.17028v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.17028

Submission history

From: Wenzhe Yin [view email]
[v1] Mon, 24 Feb 2025 10:29:15 UTC (15,922 KB)

Computer Science > Machine Learning

Title:Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators