Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Jin, Qingyun; Song, Xiaohui; Zhou, Feng; Qin, Zengchang

Computer Science > Computation and Language

arXiv:2412.20677 (cs)

[Submitted on 30 Dec 2024]

Title:Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Authors:Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

View PDF HTML (experimental)

Abstract:Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence's length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on $\mathit{L_0}$ masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.

Comments:	12 pages, 4 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.20677 [cs.CL]
	(or arXiv:2412.20677v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.20677

Submission history

From: Qingyun Jin [view email]
[v1] Mon, 30 Dec 2024 03:05:45 UTC (539 KB)

Computer Science > Computation and Language

Title:Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators