Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Litschko, Robert; Kraus, Oliver; Blaschke, Verena; Plank, Barbara

Computer Science > Computation and Language

arXiv:2412.12806 (cs)

[Submitted on 17 Dec 2024]

Title:Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Authors:Robert Litschko, Oliver Kraus, Verena Blaschke, Barbara Plank

View PDF HTML (experimental)

Abstract:A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

Comments:	Accepted at COLING 2025
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2412.12806 [cs.CL]
	(or arXiv:2412.12806v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.12806

Submission history

From: Robert Litschko [view email]
[v1] Tue, 17 Dec 2024 11:21:09 UTC (8,601 KB)

Computer Science > Computation and Language

Title:Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators