Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Yin, Hang; Lin, Zhifeng; Liu, Xin; Sun, Bin; Li, Kan

Computer Science > Artificial Intelligence

arXiv:2412.16599 (cs)

[Submitted on 21 Dec 2024]

Title:Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Authors:Hang Yin, Zhifeng Lin, Xin Liu, Bin Sun, Kan Li

View PDF HTML (experimental)

Abstract:Direction reasoning is essential for intelligent systems to understand the real world. While existing work focuses primarily on spatial reasoning, compass direction reasoning remains underexplored. To address this, we propose the Compass Direction Reasoning (CDR) benchmark, designed to evaluate the direction reasoning capabilities of multimodal language models (MLMs). CDR includes three types images to test spatial (up, down, left, right) and compass (north, south, east, west) directions. Our evaluation reveals that most MLMs struggle with direction reasoning, often performing at random guessing levels. Experiments show that training directly with CDR data yields limited improvements, as it requires an understanding of real-world physical rules. We explore the impact of mixdata and CoT fine-tuning methods, which significantly enhance MLM performance in compass direction reasoning by incorporating diverse data and step-by-step reasoning, improving the model's ability to understand direction relationships.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.16599 [cs.AI]
	(or arXiv:2412.16599v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2412.16599

Submission history

From: Zhifeng Lin [view email]
[v1] Sat, 21 Dec 2024 12:09:13 UTC (355 KB)

Computer Science > Artificial Intelligence

Title:Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators