Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Naiman, Jill P.; Cosillo, Morgan G.; Williams, Peter K. G.; Goodman, Alyssa

Computer Science > Digital Libraries

arXiv:2309.11549 (cs)

[Submitted on 20 Sep 2023]

Title:Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Authors:Jill P. Naiman, Morgan G. Cosillo, Peter K. G. Williams, Alyssa Goodman

View PDF

Abstract:Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: this https URL, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: this https URL.

Comments:	6 pages, 1 figure, 1 table; training/validation/test datasets and all model weights to be linked on Zenodo on publication
Subjects:	Digital Libraries (cs.DL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Cite as:	arXiv:2309.11549 [cs.DL]
	(or arXiv:2309.11549v1 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.2309.11549

Submission history

From: Jill Naiman [view email]
[v1] Wed, 20 Sep 2023 18:00:02 UTC (390 KB)

Computer Science > Digital Libraries

Title:Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators