Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Yue, Chongjian; Xu, Xinrun; Ma, Xiaojun; Du, Lun; Ding, Zhiming; Han, Shi; Zhang, Dongmei; Zhang, Qi

Computer Science > Computation and Language

arXiv:2412.20072 (cs)

[Submitted on 28 Dec 2024 (v1), last revised 31 Dec 2024 (this version, v2)]

Title:Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Authors:Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Zhiming Ding, Shi Han, Dongmei Zhang, Qi Zhang

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.

Comments:	ICASSP 2025
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.20072 [cs.CL]
	(or arXiv:2412.20072v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.20072

Submission history

From: Xinrun Xu [view email]
[v1] Sat, 28 Dec 2024 07:54:14 UTC (1,088 KB)
[v2] Tue, 31 Dec 2024 03:11:03 UTC (1,088 KB)

Computer Science > Computation and Language

Title:Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators