One Thousand and One Pairs: A "novel" challenge for long-context language models

Karpinska, Marzena; Thai, Katherine; Lo, Kyle; Goyal, Tanya; Iyyer, Mohit

Computer Science > Computation and Language

arXiv:2406.16264 (cs)

[Submitted on 24 Jun 2024 (v1), last revised 22 Oct 2024 (this version, v3)]

Title:One Thousand and One Pairs: A "novel" challenge for long-context language models

Authors:Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

View PDF HTML (experimental)

Abstract:Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

Comments:	EMNLP 2024, camera ready
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.16264 [cs.CL]
	(or arXiv:2406.16264v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.16264

Submission history

From: Marzena Karpinska [view email]
[v1] Mon, 24 Jun 2024 02:03:57 UTC (18,546 KB)
[v2] Thu, 18 Jul 2024 21:47:24 UTC (18,962 KB)
[v3] Tue, 22 Oct 2024 15:09:58 UTC (18,973 KB)

Computer Science > Computation and Language

Title:One Thousand and One Pairs: A "novel" challenge for long-context language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:One Thousand and One Pairs: A "novel" challenge for long-context language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators