Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.01980 (cs)

[Submitted on 3 Mar 2025]

Title:Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

View PDF HTML (experimental)

Abstract:Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at this https URL.

Comments:	CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2503.01980 [cs.CV]
	(or arXiv:2503.01980v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.01980

Submission history

From: Sara Sarto [view email]
[v1] Mon, 3 Mar 2025 19:01:17 UTC (4,688 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators