Pralekha: An Indic Document Alignment Evaluation Benchmark

Suryanarayanan, Sanjay; Song, Haiyue; Khan, Mohammed Safi Ur Rahman; Kunchukuttan, Anoop; Khapra, Mitesh M.; Dabre, Raj

Computer Science > Computation and Language

arXiv:2411.19096 (cs)

[Submitted on 28 Nov 2024]

Title:Pralekha: An Indic Document Alignment Evaluation Benchmark

Authors:Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M. Khapra, Raj Dabre

View PDF HTML (experimental)

Abstract:Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

Comments:	Work in Progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2411.19096 [cs.CL]
	(or arXiv:2411.19096v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.19096

Submission history

From: Sanjay Suryanarayanan [view email]
[v1] Thu, 28 Nov 2024 12:17:24 UTC (1,102 KB)

Computer Science > Computation and Language

Title:Pralekha: An Indic Document Alignment Evaluation Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Pralekha: An Indic Document Alignment Evaluation Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators