Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training

Liu, Chong; Zhang, Yuqi; Wang, Hongsong; Chen, Weihua; Wang, Fan; Huang, Yan; Shen, Yi-Dong; Wang, Liang

doi:10.1109/TIP.2023.3286710

Abstract:Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches.

Comments:	Code is publicly available: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08789 [cs.CV]
	(or arXiv:2306.08789v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.08789
Related DOI:	https://doi.org/10.1109/TIP.2023.3286710

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators