Saliency-driven Dynamic Token Pruning for Large Language Models

Tao, Yao; Tang, Yehui; Wang, Yun; Zhu, Mingjian; Hu, Hailin; Wang, Yunhe

Computer Science > Computation and Language

arXiv:2504.04514 (cs)

[Submitted on 6 Apr 2025 (v1), last revised 9 Apr 2025 (this version, v2)]

Title:Saliency-driven Dynamic Token Pruning for Large Language Models

Authors:Yao Tao, Yehui Tang, Yun Wang, Mingjian Zhu, Hailin Hu, Yunhe Wang

View PDF HTML (experimental)

Abstract:Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.04514 [cs.CL]
	(or arXiv:2504.04514v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.04514

Submission history

From: Yao Tao [view email]
[v1] Sun, 6 Apr 2025 15:15:07 UTC (84 KB)
[v2] Wed, 9 Apr 2025 14:36:19 UTC (84 KB)

Computer Science > Computation and Language

Title:Saliency-driven Dynamic Token Pruning for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Saliency-driven Dynamic Token Pruning for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators