On the Hidden Mystery of OCR in Large Multimodal Models

Liu, Yuliang; Li, Zhang; Li, Hongliang; Yu, Wenwen; Huang, Mingxin; Peng, Dezhi; Liu, Mingyu; Chen, Mingrui; Li, Chunyuan; Jin, Lianwen; Bai, Xiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.07895v2 (cs)

[Submitted on 13 May 2023 (v1), revised 31 May 2023 (this version, v2), latest version 26 Aug 2024 (v7)]

Title:On the Hidden Mystery of OCR in Large Multimodal Models

Authors:Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, Xiang Bai

View PDF

Abstract:Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting fine-grained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline will be available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2305.07895 [cs.CV]
	(or arXiv:2305.07895v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.07895

Submission history

From: Yuliang Liu [view email]
[v1] Sat, 13 May 2023 11:28:37 UTC (22,220 KB)
[v2] Wed, 31 May 2023 08:36:44 UTC (24,524 KB)
[v3] Thu, 8 Jun 2023 15:14:16 UTC (6,183 KB)
[v4] Mon, 19 Jun 2023 03:36:08 UTC (4,989 KB)
[v5] Wed, 17 Jan 2024 12:02:33 UTC (2,225 KB)
[v6] Wed, 14 Aug 2024 03:30:14 UTC (2,236 KB)
[v7] Mon, 26 Aug 2024 02:37:14 UTC (2,236 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:On the Hidden Mystery of OCR in Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:On the Hidden Mystery of OCR in Large Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators