Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Colakoglu, Gaye; Solmaz, Gürkan; Fürst, Jonathan

Computer Science > Computation and Language

arXiv:2502.18179 (cs)

[Submitted on 25 Feb 2025]

Title:Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Authors:Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst

View PDF HTML (experimental)

Abstract:This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study delves into the sub-problems within these core challenges, such as input representation, chunking, prompting, and selection of LLMs and multimodal models. It examines the outcomes of different design choices through a new layout-aware IE test suite, benchmarking against the state-of-art (SoA) model LayoutLMv3. The results show that the configuration from one-factor-at-a-time (OFAT) trial achieves near-optimal results with 14.1 points F1-score gain from the baseline model, while full factorial exploration yields only a slightly higher 15.1 points gain at around 36x greater token usage. We demonstrate that well-configured general-purpose LLMs can match the performance of specialized models, providing a cost-effective alternative. Our test-suite is freely available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.18179 [cs.CL]
	(or arXiv:2502.18179v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.18179

Submission history

From: Jonathan Fürst [view email]
[v1] Tue, 25 Feb 2025 13:11:53 UTC (1,007 KB)

Computer Science > Computation and Language

Title:Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators