Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Zhao, Bowen; Dirac, Leo Parker; Varshavskaya, Paulina

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.17080 (cs)

[Submitted on 25 Sep 2024]

Title:Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Authors:Bowen Zhao, Leo Parker Dirac, Paulina Varshavskaya

View PDF HTML (experimental)

Abstract:Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.

Comments:	13 pages, 4 figures. Code released at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2409.17080 [cs.CV]
	(or arXiv:2409.17080v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.17080

Submission history

From: Bowen Zhao [view email]
[v1] Wed, 25 Sep 2024 16:45:02 UTC (7,271 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2024-09

Change to browse by:

cs
cs.CL

References & Citations

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators