On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Cai, Zhenyang; Chen, Junying; Wang, Rongsheng; Wang, Weihong; Deng, Yonglin; Song, Dingjie; Chen, Yize; Zhang, Zixu; Wang, Benyou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.20070 (cs)

[Submitted on 28 Dec 2024]

Title:On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Authors:Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2412.20070 [cs.CV]
	(or arXiv:2412.20070v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.20070

Submission history

From: Zhenyang Cai [view email]
[v1] Sat, 28 Dec 2024 07:50:00 UTC (8,091 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators