Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Molina, Adrià; Terrades, Oriol Ramos; Lladós, Josep

Computer Science > Information Retrieval

arXiv:2406.07315v1 (cs)

[Submitted on 11 Jun 2024 (this version), latest version 16 Jun 2024 (v2)]

Title:Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Authors:Adrià Molina, Oriol Ramos Terrades, Josep Lladós

View PDF HTML (experimental)

Abstract:This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.

Comments:	Preprint for the manuscript accepted for publication in the DAS2024 LNCS proceedings
Subjects:	Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.07315 [cs.IR]
	(or arXiv:2406.07315v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2406.07315

Submission history

From: Adrià Molina Rodríguez [view email]
[v1] Tue, 11 Jun 2024 14:45:00 UTC (44,253 KB)
[v2] Sun, 16 Jun 2024 16:59:29 UTC (12,802 KB)

Computer Science > Information Retrieval

Title:Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators