CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Abdallah, Abdelrahman; Abdalla, Mahmoud; Kasem, Mahmoud SalahEldin; Mahmoud, Mohamed; Abdelhalim, Ibrahim; Elkasaby, Mohamed; ElBendary, Yasser; Jatowt, Adam

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.04493 (cs)

[Submitted on 6 Jun 2024]

Title:CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Authors:Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt

View PDF HTML (experimental)

Abstract:In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (this https URL).

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2406.04493 [cs.CV]
	(or arXiv:2406.04493v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.04493

Submission history

From: Abdelrahman E.M. Abdallah [view email]
[v1] Thu, 6 Jun 2024 20:38:15 UTC (14,130 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators