Bootstrap Your Own Context Length

Wang, Liang; Yang, Nan; Zhang, Xingxing; Huang, Xiaolong; Wei, Furu

Computer Science > Computation and Language

arXiv:2412.18860 (cs)

[Submitted on 25 Dec 2024]

Title:Bootstrap Your Own Context Length

Authors:Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, Furu Wei

View PDF HTML (experimental)

Abstract:We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.

Comments:	18 pages
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2412.18860 [cs.CL]
	(or arXiv:2412.18860v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.18860

Submission history

From: Liang Wang [view email]
[v1] Wed, 25 Dec 2024 10:08:54 UTC (184 KB)

Computer Science > Computation and Language

Title:Bootstrap Your Own Context Length

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Bootstrap Your Own Context Length

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators