RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Moshtaghi, Mehdi; Khajavi, Siavash H.; Pajarinen, Joni

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.19654 (cs)

[Submitted on 25 Mar 2025 (v1), last revised 30 Mar 2025 (this version, v3)]

Title:RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Authors:Mehdi Moshtaghi, Siavash H. Khajavi, Joni Pajarinen

View PDF HTML (experimental)

Abstract:We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2503.19654 [cs.CV]
	(or arXiv:2503.19654v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.19654

Submission history

From: Mehdi Moshtaghi [view email]
[v1] Tue, 25 Mar 2025 13:43:47 UTC (5,836 KB)
[v2] Thu, 27 Mar 2025 10:11:22 UTC (5,836 KB)
[v3] Sun, 30 Mar 2025 15:08:23 UTC (5,836 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators