RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Yuan, Zheng; Jin, Qiao; Tan, Chuanqi; Zhao, Zhengyun; Yuan, Hongyi; Huang, Fei; Huang, Songfang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.00534 (cs)

[Submitted on 1 Mar 2023]

Title:RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Authors:Zheng Yuan, Qiao Jin, Chuanqi Tan, Zhengyun Zhao, Hongyi Yuan, Fei Huang, Songfang Huang

View PDF

Abstract:Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2303.00534 [cs.CV]
	(or arXiv:2303.00534v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.00534

Submission history

From: Zheng Yuan [view email]
[v1] Wed, 1 Mar 2023 14:21:19 UTC (1,843 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators