Image Embedding Sampling Method for Diverse Captioning

Waheed, Sania; An, Na Min

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.10118 (cs)

[Submitted on 14 Feb 2025]

Title:Image Embedding Sampling Method for Diverse Captioning

Authors:Sania Waheed, Na Min An

View PDF HTML (experimental)

Abstract:Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

Comments:	15 pages, 5 figures, 6 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.10118 [cs.CV]
	(or arXiv:2502.10118v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.10118

Submission history

From: Na Min An [view email]
[v1] Fri, 14 Feb 2025 12:33:19 UTC (5,385 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Image Embedding Sampling Method for Diverse Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Image Embedding Sampling Method for Diverse Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators