Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Liu, Yishen; Liu, Shengda; Pan, Hudan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.18297 (cs)

[Submitted on 24 Mar 2025 (v1), last revised 27 Mar 2025 (this version, v2)]

Title:Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Authors:Yishen Liu, Shengda Liu, Hudan Pan

View PDF HTML (experimental)

Abstract:Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.18297 [cs.CV]
	(or arXiv:2503.18297v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.18297

Submission history

From: Yishen Liu [view email]
[v1] Mon, 24 Mar 2025 03:02:11 UTC (1,832 KB)
[v2] Thu, 27 Mar 2025 06:47:06 UTC (1,832 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators