OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Fu, Ling; Yang, Biao; Kuang, Zhebin; Song, Jiajun; Li, Yuzhe; Zhu, Linghao; Luo, Qidi; Wang, Xinyu; Lu, Hao; Huang, Mingxin; Li, Zhang; Tang, Guozhi; Shan, Bin; Lin, Chunhui; Liu, Qi; Wu, Binghong; Feng, Hao; Liu, Hao; Huang, Can; Tang, Jingqun; Chen, Wei; Jin, Lianwen; Liu, Yuliang; Bai, Xiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.00321 (cs)

[Submitted on 31 Dec 2024]

Title:OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Abstract:Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.00321 [cs.CV]
	(or arXiv:2501.00321v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.00321

Submission history

From: Ling Fu [view email]
[v1] Tue, 31 Dec 2024 07:32:35 UTC (7,078 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators