Zero-shot Composed Text-Image Retrieval

Liu, Yikun; Yao, Jiangchao; Zhang, Ya; Wang, Yanfeng; Xie, Weidi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.07272 (cs)

[Submitted on 12 Jun 2023 (v1), last revised 6 Mar 2024 (this version, v2)]

Title:Zero-shot Composed Text-Image Retrieval

Authors:Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi Xie

View PDF HTML (experimental)

Abstract:In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.07272 [cs.CV]
	(or arXiv:2306.07272v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.07272

Submission history

From: Yikun Liu [view email]
[v1] Mon, 12 Jun 2023 17:56:01 UTC (37,453 KB)
[v2] Wed, 6 Mar 2024 07:16:06 UTC (37,459 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Zero-shot Composed Text-Image Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Zero-shot Composed Text-Image Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators