Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

Yang, Yuguang; Wang, Yiming; Geng, Shupeng; Wang, Runqi; Wang, Yimi; Wu, Sheng; Zhang, Baochang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.06691 (cs)

[Submitted on 11 Jun 2023]

Title:Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

Authors:Yuguang Yang, Yiming Wang, Shupeng Geng, Runqi Wang, Yimi Wang, Sheng Wu, Baochang Zhang

View PDF

Abstract:The emergence of cross-modal foundation models has introduced numerous approaches grounded in text-image retrieval. However, on some domain-specific retrieval tasks, these models fail to focus on the key attributes required. To address this issue, we propose a self-enhancement framework, A^{3}R, based on the CLIP-ViT/G-14, one of the largest cross-modal models. First, we perform an Attribute Augmentation strategy to enrich the textual description for fine-grained representation before model learning. Then, we propose an Adaption Re-ranking method to unify the representation space of textual query and candidate images and re-rank candidate images relying on the adapted query after model learning. The proposed framework is validated to achieve a salient improvement over the baseline and other teams' solutions in the cross-modal image retrieval track of the 1st foundation model challenge without introducing any additional samples. The code is available at \url{this https URL}.

Comments:	Accepted by CVPR 2023 Workshop
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.06691 [cs.CV]
	(or arXiv:2306.06691v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.06691

Submission history

From: Yiming Wang [view email]
[v1] Sun, 11 Jun 2023 14:25:38 UTC (443 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators