Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Tang, Yuanmin; Yu, Jing; Gai, Keke; Zhuang, Jiamin; Xiong, Gang; Gou, Gaopeng; Wu, Qi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.17109 (cs)

[Submitted on 21 Mar 2025 (v1), last revised 30 Mar 2025 (this version, v2)]

Title:Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Authors:Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, Qi Wu

View PDF HTML (experimental)

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at this https URL.

Comments:	This work has been accepted to CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.17109 [cs.CV]
	(or arXiv:2503.17109v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.17109

Submission history

From: Yuanmin Tang [view email]
[v1] Fri, 21 Mar 2025 12:49:50 UTC (10,212 KB)
[v2] Sun, 30 Mar 2025 12:19:03 UTC (10,212 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators