Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Lee, Phillip Y.; Je, Jihyeon; Park, Chanho; Uy, Mikaela Angelina; Guibas, Leonidas; Sung, Minhyuk

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.17207 (cs)

[Submitted on 24 Apr 2025]

Title:Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Authors:Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung

View PDF

Abstract:We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.17207 [cs.CV]
	(or arXiv:2504.17207v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.17207

Submission history

From: Phillip Y. Lee [view email]
[v1] Thu, 24 Apr 2025 02:41:34 UTC (1,803 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators