Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Bambhaniya, Abhimanyu Rajeshkumar; Yazdanbakhsh, Amir; Subramanian, Suvinay; Kao, Sheng-Chun; Agrawal, Shivani; Evci, Utku; Krishna, Tushar

Computer Science > Machine Learning

arXiv:2402.04744 (cs)

[Submitted on 7 Feb 2024]

Title:Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Authors:Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

View PDF HTML (experimental)

Abstract:N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at this https URL.

Comments:	18 pages, 8 figures, 17 tables. Code is available at this https URL
Subjects:	Machine Learning (cs.LG); Hardware Architecture (cs.AR)
Cite as:	arXiv:2402.04744 [cs.LG]
	(or arXiv:2402.04744v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.04744

Submission history

From: Amir Yazdanbakhsh [view email]
[v1] Wed, 7 Feb 2024 10:55:59 UTC (2,625 KB)

Computer Science > Machine Learning

Title:Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators