Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

Zhou, Pan; Lin, Qian; Loghin, Dumitrel; Ooi, Beng Chin; Wu, Yuncheng; Yu, Hongfang

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2009.05766 (cs)

[Submitted on 12 Sep 2020 (v1), last revised 20 Oct 2020 (this version, v2)]

Title:Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

Authors:Pan Zhou, Qian Lin, Dumitrel Loghin, Beng Chin Ooi, Yuncheng Wu, Hongfang Yu

View PDF

Abstract:In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state-of-the-art machine learning approaches to perform efficient training. Both centralized and decentralized training approaches suffer from low-speed links. In this paper, we propose a decentralized approach, namely NetMax, that enables worker nodes to communicate via high-speed links and, thus, significantly speed up the training process. NetMax possesses the following novel features. First, it consists of a novel consensus algorithm that allows worker nodes to train model copies on their local dataset asynchronously and exchange information via peer-to-peer communication to synchronize their local copies, instead of a central master node (i.e., parameter server). Second, each worker node selects one peer randomly with a fine-tuned probability to exchange information per iteration. In particular, peers with high-speed links are selected with high probability. Third, the probabilities of selecting peers are designed to minimize the total convergence time. Moreover, we mathematically prove the convergence of NetMax. We evaluate NetMax on heterogeneous cluster networks and show that it achieves speedups of 3.7X, 3.4X, and 1.9X in comparison with the state-of-the-art decentralized training approaches Prague, Allreduce-SGD, and AD-PSGD, respectively.

Comments:	17 pages, 19 figures, accepted by conference ICDE'2021
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2009.05766 [cs.DC]
	(or arXiv:2009.05766v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2009.05766

Submission history

From: Pan Zhou [view email]
[v1] Sat, 12 Sep 2020 11:17:55 UTC (22,877 KB)
[v2] Tue, 20 Oct 2020 13:02:06 UTC (22,948 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators