Evaluating Vision-Language Models as Evaluators in Path Planning

Aghzal, Mohamed; Yue, Xiang; Plaku, Erion; Yao, Ziyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.18711 (cs)

[Submitted on 27 Nov 2024 (v1), last revised 26 Mar 2025 (this version, v3)]

Title:Evaluating Vision-Language Models as Evaluators in Path Planning

Authors:Mohamed Aghzal, Xiang Yue, Erion Plaku, Ziyu Yao

View PDF

Abstract:Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2411.18711 [cs.CV]
	(or arXiv:2411.18711v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.18711

Submission history

From: Mohamed Aghzal [view email]
[v1] Wed, 27 Nov 2024 19:32:03 UTC (4,639 KB)
[v2] Tue, 4 Mar 2025 03:01:25 UTC (4,699 KB)
[v3] Wed, 26 Mar 2025 20:18:38 UTC (4,699 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating Vision-Language Models as Evaluators in Path Planning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating Vision-Language Models as Evaluators in Path Planning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators