olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Poznanski, Jake; Borchardt, Jon; Dunkelberger, Jason; Huff, Regan; Lin, Daniel; Rangapur, Aman; Wilhelm, Christopher; Lo, Kyle; Soldaini, Luca

Computer Science > Computation and Language

arXiv:2502.18443 (cs)

[Submitted on 25 Feb 2025]

Title:olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Authors:Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini

View PDF HTML (experimental)

Abstract:PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.18443 [cs.CL]
	(or arXiv:2502.18443v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.18443

Submission history

From: Luca Soldaini [view email]
[v1] Tue, 25 Feb 2025 18:38:38 UTC (3,128 KB)

Computer Science > Computation and Language

Title:olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators