It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Schnaus, Dominik; Araslanov, Nikita; Cremers, Daniel

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.24129 (cs)

[Submitted on 31 Mar 2025]

Title:It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Authors:Dominik Schnaus, Nikita Araslanov, Daniel Cremers

View PDF HTML (experimental)

Abstract:The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

Comments:	Accepted to CVPR 2025, Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2503.24129 [cs.CV]
	(or arXiv:2503.24129v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.24129

Submission history

From: Dominik Schnaus [view email]
[v1] Mon, 31 Mar 2025 14:14:25 UTC (1,031 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators