Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Chen, Shiqi; Zhu, Tongyao; Zhou, Ruochen; Zhang, Jinghan; Gao, Siyang; Niebles, Juan Carlos; Geva, Mor; He, Junxian; Wu, Jiajun; Li, Manling

Computer Science > Computation and Language

arXiv:2503.01773 (cs)

[Submitted on 3 Mar 2025 (v1), last revised 4 Mar 2025 (this version, v2)]

Title:Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Authors:Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li

View PDF HTML (experimental)

Abstract:Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at this https URL.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.01773 [cs.CL]
	(or arXiv:2503.01773v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.01773

Submission history

From: Shiqi Chen [view email]
[v1] Mon, 3 Mar 2025 17:57:03 UTC (22,080 KB)
[v2] Tue, 4 Mar 2025 18:01:19 UTC (22,080 KB)

Computer Science > Computation and Language

Title:Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators