Computer Science > Computer Vision and Pattern Recognition
[Submitted on 13 May 2023 (v1), revised 31 May 2023 (this version, v2), latest version 26 Aug 2024 (v7)]
Title:On the Hidden Mystery of OCR in Large Multimodal Models
View PDFAbstract:Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting fine-grained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline will be available at this https URL.
Submission history
From: Yuliang Liu [view email][v1] Sat, 13 May 2023 11:28:37 UTC (22,220 KB)
[v2] Wed, 31 May 2023 08:36:44 UTC (24,524 KB)
[v3] Thu, 8 Jun 2023 15:14:16 UTC (6,183 KB)
[v4] Mon, 19 Jun 2023 03:36:08 UTC (4,989 KB)
[v5] Wed, 17 Jan 2024 12:02:33 UTC (2,225 KB)
[v6] Wed, 14 Aug 2024 03:30:14 UTC (2,236 KB)
[v7] Mon, 26 Aug 2024 02:37:14 UTC (2,236 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.