SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

Li, Haoxuan; Bin, Yi; Ma, Yunshan; Wang, Guoqing; Yang, Yang; Ng, See-Kiong; Chua, Tat-Seng

Abstract:Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

Subjects:	Information Retrieval (cs.IR); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2504.13172 [cs.IR]
	(or arXiv:2504.13172v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2504.13172

Computer Science > Information Retrieval

Title:SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators