SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Cheng, Xianfu; Zhang, Wei; Zhang, Shiwei; Yang, Jian; Guan, Xiangyuan; Wu, Xianjie; Li, Xiang; Zhang, Ge; Liu, Jiaheng; Mai, Yuying; Zeng, Yutao; Wen, Zhoufutu; Jin, Ke; Wang, Baorui; Zhou, Weixiao; Lu, Yunhong; Li, Tongliang; Huang, Wenhao; Li, Zhoujun

Computer Science > Computation and Language

arXiv:2502.13059 (cs)

[Submitted on 18 Feb 2025]

Title:SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Authors:Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li

View PDF

Abstract:The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.13059 [cs.CL]
	(or arXiv:2502.13059v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.13059

Submission history

From: Jian Yang [view email]
[v1] Tue, 18 Feb 2025 17:04:26 UTC (19,427 KB)

Computer Science > Computation and Language

Title:SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators