BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Kuratov, Yuri; Bulatov, Aydar; Anokhin, Petr; Rodkin, Ivan; Sorokin, Dmitry; Sorokin, Artyom; Burtsev, Mikhail

Computer Science > Computation and Language

arXiv:2406.10149 (cs)

[Submitted on 14 Jun 2024]

Title:BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Authors:Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

View PDF HTML (experimental)

Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.10149 [cs.CL]
	(or arXiv:2406.10149v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.10149

Submission history

From: Yuri Kuratov [view email]
[v1] Fri, 14 Jun 2024 16:00:29 UTC (7,834 KB)

Computer Science > Computation and Language

Title:BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators