Weakly supervised information extraction from inscrutable handwritten document images

Paul, Sujoy; Madan, Gagan; Mishra, Akankshya; Hegde, Narayan; Kumar, Pradeep; Aggarwal, Gaurav

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.06823 (cs)

[Submitted on 12 Jun 2023]

Title:Weakly supervised information extraction from inscrutable handwritten document images

Authors:Sujoy Paul, Gagan Madan, Akankshya Mishra, Narayan Hegde, Pradeep Kumar, Gaurav Aggarwal

View PDF

Abstract:State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions.

Comments:	Accepted at ICDAR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2306.06823 [cs.CV]
	(or arXiv:2306.06823v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.06823

Submission history

From: Sujoy Paul [view email]
[v1] Mon, 12 Jun 2023 02:22:30 UTC (1,751 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly supervised information extraction from inscrutable handwritten document images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly supervised information extraction from inscrutable handwritten document images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators