Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Zhu, Wenhao; Chen, Pinzhen; Hu, Hanxu; Huang, Shujian; Yuan, Fei; Chen, Jiajun; Birch, Alexandra

Computer Science > Computation and Language

arXiv:2502.15592 (cs)

[Submitted on 21 Feb 2025]

Title:Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Authors:Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch

View PDF HTML (experimental)

Abstract:Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.15592 [cs.CL]
	(or arXiv:2502.15592v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.15592

Submission history

From: Wenhao Zhu [view email]
[v1] Fri, 21 Feb 2025 17:02:40 UTC (149 KB)

Computer Science > Computation and Language

Title:Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators