Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Zhu, Yingjie; Bai, Xuefeng; Chen, Kehai; Xiang, Yang; Zhang, Min

Computer Science > Computation and Language

arXiv:2412.13540 (cs)

[Submitted on 18 Dec 2024]

Title:Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Authors:Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Min Zhang

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through 3 self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs' zero-shot performance on fundamental graph learning tasks, as well as enhancing the robustness of LVLMs against complex visual graphs.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.13540 [cs.CL]
	(or arXiv:2412.13540v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.13540

Submission history

From: Yingjie Zhu [view email]
[v1] Wed, 18 Dec 2024 06:35:18 UTC (704 KB)

Computer Science > Computation and Language

Title:Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators