ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Yueyu, Lin; Zhiyuan, Li; Yue, Peter; Xiao, Liu

Computer Science > Computation and Language

arXiv:2501.15570 (cs)

[Submitted on 26 Jan 2025]

Title:ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Authors:Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao

View PDF HTML (experimental)

Abstract:As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{this https URL}{this https URL}, \href{this https URL}{this https URL}.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.15570 [cs.CL]
	(or arXiv:2501.15570v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.15570

Submission history

From: Xiao Liu [view email]
[v1] Sun, 26 Jan 2025 15:56:56 UTC (651 KB)

Computer Science > Computation and Language

Title:ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators