ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Zhan, Guanqi; Liu, Yuanpei; Han, Kai; Xie, Weidi; Zisserman, Andrew

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.15682 (cs)

[Submitted on 21 Feb 2025 (v1), last revised 27 Mar 2025 (this version, v2)]

Title:ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Authors:Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman

View PDF HTML (experimental)

Abstract:The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.15682 [cs.CV]
	(or arXiv:2502.15682v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.15682

Submission history

From: Guanqi Zhan [view email]
[v1] Fri, 21 Feb 2025 18:59:57 UTC (16,880 KB)
[v2] Thu, 27 Mar 2025 17:57:43 UTC (37,949 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators