To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Marshall, Noah; Xiao, Ke Liang; Agarwala, Atish; Paquette, Elliot

Statistics > Machine Learning

arXiv:2406.11733 (stat)

[Submitted on 17 Jun 2024 (v1), last revised 6 Oct 2024 (this version, v2)]

Title:To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Authors:Noah Marshall, Ke Liang Xiao, Atish Agarwala, Elliot Paquette

View PDF HTML (experimental)

Abstract:The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2406.11733 [stat.ML]
	(or arXiv:2406.11733v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2406.11733

Submission history

From: Noah Marshall [view email]
[v1] Mon, 17 Jun 2024 16:50:22 UTC (197 KB)
[v2] Sun, 6 Oct 2024 18:51:39 UTC (470 KB)

Statistics > Machine Learning

Title:To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators