Activations and Gradients Compression for Model-Parallel Training

Rudakov, Mikhail; Beznosikov, Aleksandr; Kholodov, Yaroslav; Gasnikov, Alexander

doi:10.1134/S1064562423701314

Computer Science > Machine Learning

arXiv:2401.07788 (cs)

[Submitted on 15 Jan 2024 (v1), last revised 26 Mar 2024 (this version, v2)]

Title:Activations and Gradients Compression for Model-Parallel Training

Authors:Mikhail Rudakov, Aleksandr Beznosikov, Yaroslav Kholodov, Alexander Gasnikov

View PDF HTML (experimental)

Abstract:Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.

Comments:	17 pages, 6 figures, 5 tables
Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Cite as:	arXiv:2401.07788 [cs.LG]
	(or arXiv:2401.07788v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2401.07788
Related DOI:	https://doi.org/10.1134/S1064562423701314

Submission history

From: Aleksandr Beznosikov [view email]
[v1] Mon, 15 Jan 2024 15:54:54 UTC (3,859 KB)
[v2] Tue, 26 Mar 2024 16:49:44 UTC (3,859 KB)

Computer Science > Machine Learning

Title:Activations and Gradients Compression for Model-Parallel Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Activations and Gradients Compression for Model-Parallel Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators