MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Ma, Yubo; Zang, Yuhang; Chen, Liangyu; Chen, Meiqi; Jiao, Yizhu; Li, Xinze; Lu, Xinyuan; Liu, Ziyu; Ma, Yan; Dong, Xiaoyi; Zhang, Pan; Pan, Liangming; Jiang, Yu-Gang; Wang, Jiaqi; Cao, Yixin; Sun, Aixin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.01523 (cs)

[Submitted on 1 Jul 2024 (v1), last revised 10 Jul 2024 (this version, v2)]

Title:MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Authors:Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

View PDF HTML (experimental)

Abstract:Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2407.01523 [cs.CV]
	(or arXiv:2407.01523v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.01523

Submission history

From: Yuhang Zang [view email]
[v1] Mon, 1 Jul 2024 17:59:26 UTC (24,157 KB)
[v2] Wed, 10 Jul 2024 15:31:09 UTC (24,112 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators