SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Zhao, Minjun; Yin, Yichen; Mao, Yuren; Liu, Qing; Chen, Lu; Gao, Yunjun

Computer Science > Machine Learning

arXiv:2304.00737 (cs)

[Submitted on 3 Apr 2023 (v1), last revised 23 Feb 2024 (this version, v2)]

Title:SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Authors:Minjun Zhao, Yichen Yin, Yuren Mao, Qing Liu, Lu Chen, Yunjun Gao

View PDF HTML (experimental)

Abstract:Top-k sparsification has recently been widely used to reduce the communication volume in distributed deep learning. However, due to the Sparse Gradient Accumulation (SGA) dilemma, the performance of top-k sparsification still has limitations. Recently, a few methods have been put forward to handle the SGA dilemma. Regrettably, even the state-of-the-art method suffers from several drawbacks, e.g., it relies on an inefficient communication algorithm and requires extra transmission steps. Motivated by the limitations of existing methods, we propose a novel efficient sparse communication framework, called SparDL. Specifically, SparDL uses the Spar-Reduce-Scatter algorithm, which is based on an efficient Reduce-Scatter model, to handle the SGA dilemma without additional communication operations. Besides, to further reduce the latency cost and improve the efficiency of SparDL, we propose the Spar-All-Gather algorithm. Moreover, we propose the global residual collection algorithm to ensure fast convergence of model training. Finally, extensive experiments are conducted to validate the superiority of SparDL.

Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2304.00737 [cs.LG]
	(or arXiv:2304.00737v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2304.00737

Submission history

From: Minjun Zhao [view email]
[v1] Mon, 3 Apr 2023 06:15:50 UTC (1,567 KB)
[v2] Fri, 23 Feb 2024 15:35:18 UTC (6,682 KB)

Computer Science > Machine Learning

Title:SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators