Adaptive Computation Pruning for the Forgetting Transformer

Lin, Zhixuan; Obando-Ceron, Johan; He, Xu Owen; Courville, Aaron

Computer Science > Machine Learning

arXiv:2504.06949 (cs)

[Submitted on 9 Apr 2025]

Title:Adaptive Computation Pruning for the Forgetting Transformer

Authors:Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville

View PDF HTML (experimental)

Abstract:The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at this https URL.

Comments:	Preprint. Under review
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2504.06949 [cs.LG]
	(or arXiv:2504.06949v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.06949

Submission history

From: Zhixuan Lin [view email]
[v1] Wed, 9 Apr 2025 14:57:55 UTC (706 KB)

Computer Science > Machine Learning

Title:Adaptive Computation Pruning for the Forgetting Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Adaptive Computation Pruning for the Forgetting Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators