Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sarto, Sara; Cornia, Marcella; Cucchiara, Rita

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.14604 (cs)

[Submitted on 18 Mar 2025]

Title:Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Authors:Sara Sarto, Marcella Cornia, Rita Cucchiara

View PDF HTML (experimental)

Abstract:The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Comments:	Repo GitHub: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2503.14604 [cs.CV]
	(or arXiv:2503.14604v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.14604

Submission history

From: Sara Sarto [view email]
[v1] Tue, 18 Mar 2025 18:03:56 UTC (262 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators