UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

Xiong, Yizhe; Huang, Wei; Ye, Xin; Chen, Hui; Lin, Zijia; Lian, Haoran; Su, Zhenpeng; Han, Jungong; Ding, Guiguang

Computer Science > Computation and Language

arXiv:2502.00439 (cs)

[Submitted on 1 Feb 2025]

Title:UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

Authors:Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding

View PDF HTML (experimental)

Abstract:Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{this https URL}.

Comments:	11 pages, 4 figures. Preprint, under review
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.00439 [cs.CL]
	(or arXiv:2502.00439v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.00439

Submission history

From: Yizhe Xiong [view email]
[v1] Sat, 1 Feb 2025 14:16:31 UTC (612 KB)

Computer Science > Computation and Language

Title:UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators