LongAttn: Selecting Long-context Training Data via Token-level Attention

Wu, Longyun; Zhu, Dawei; Zhao, Guangxiang; Yu, Zhuocheng; Ran, Junfeng; Wong, Xiangyu; Sun, Lin; Li, Sujian

Computer Science > Computation and Language

arXiv:2502.16860 (cs)

[Submitted on 24 Feb 2025 (v1), last revised 27 Feb 2025 (this version, v2)]

Title:LongAttn: Selecting Long-context Training Data via Token-level Attention

Authors:Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li

View PDF

Abstract:With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.

Comments:	17 pages, 5 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.16860 [cs.CL]
	(or arXiv:2502.16860v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.16860

Submission history

From: Longyun Wu [view email]
[v1] Mon, 24 Feb 2025 05:51:53 UTC (938 KB)
[v2] Thu, 27 Feb 2025 14:50:10 UTC (938 KB)

Computer Science > Computation and Language

Title:LongAttn: Selecting Long-context Training Data via Token-level Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LongAttn: Selecting Long-context Training Data via Token-level Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators