GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Yu, Menglu; Tian, Ye; Ji, Bo; Wu, Chuan; Rajan, Hridesh; Liu, Jia

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2202.01158 (cs)

[Submitted on 2 Feb 2022]

Title:GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Authors:Menglu Yu, Ye Tian, Bo Ji, Chuan Wu, Hridesh Rajan, Jia Liu

View PDF

Abstract:Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called ``ring-all-reduce'' decentralized architecture has been increasingly adopted to remove the need for dedicated parameter servers. To date, however, there remains a lack of theoretical understanding on how to design resource optimization algorithms for efficiently scheduling ring-all-reduce DDL jobs in computing clusters. This motivates us to fill this gap by proposing a series of new resource scheduling designs for ring-all-reduce DDL jobs. Our contributions in this paper are three-fold: i) We propose a new resource scheduling analytical model for ring-all-reduce deep learning, which covers a wide range of objectives in DDL performance optimization (e.g., excessive training avoidance, energy efficiency, fairness); ii) Based on the proposed performance analytical model, we develop an efficient resource scheduling algorithm called GADGET (greedy ring-all-reduce distributed graph embedding technique), which enjoys a provable strong performance guarantee; iii) We conduct extensive trace-driven experiments to demonstrate the effectiveness of the GADGET approach and its superiority over the state of the art.

Comments:	Accepted in Proc. IEEE INFOCOM, Virtual Event, May 2022
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2202.01158 [cs.DC]
	(or arXiv:2202.01158v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2202.01158

Submission history

From: Menglu Yu [view email]
[v1] Wed, 2 Feb 2022 17:35:16 UTC (2,523 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators