How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Wen, Ximing; Mainali, Mallika; Sen, Anik

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.22093 (cs)

[Submitted on 28 Mar 2025]

Title:How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Authors:Ximing Wen, Mallika Mainali, Anik Sen

View PDF HTML (experimental)

Abstract:Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; However, their ability to perform Theory of Mind (ToM) tasks such as accurately inferring human intentions, beliefs, and other mental states remains underexplored. In this work, we propose an open-ended question framework to comprehensively evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset composed of 30 images. We then assessed the performance of four VLMs of varying sizes on this dataset. Our experimental results show that the GPT-4 model outperformed all others, with only one smaller model, GPT-4o-mini, achieving comparable performance. Additionally, we observed that VLMs often struggle to accurately infer intentions in complex scenarios such as bullying or cheating. Moreover, our findings also reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues.

Comments:	2 pages, accepted by ToM@AAAI25
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.22093 [cs.CV]
	(or arXiv:2503.22093v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.22093

Submission history

From: Ximing Wen [view email]
[v1] Fri, 28 Mar 2025 02:26:32 UTC (8,849 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators