Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Sen, Tanmoy; Shen, Haiying

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2206.00774 (cs)

[Submitted on 1 Jun 2022]

Title:Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Authors:Tanmoy Sen, Haiying Shen

View PDF

Abstract:Edge devices with local computation capability has made distributed deep learning training on edges possible. In such method, the cluster head of a cluster of edges schedules DL training jobs from the edges. Using such centralized scheduling method, the cluster head knows all loads of edges, which can avoid overloading the cluster edges, but the head itself may become overloaded. To handle this problem, we propose a multi-agent RL (MARL) system that enables each edge to schedule its jobs using RL. However, without coordination among edges, action collision may occur, in which multiple edges schedule tasks to the same edge and make it overloaded. For this reason, we propose a system called Shielded ReinfOrcement learning (RL) based DL training on Edges (SROLE). In SROLE, the shield deployed in an edge checks action collisions and provides alternative actions to avoid collisions. As the central shield for entire cluster may become a bottleneck, we further propose a decentralized shielding method, where different shields are responsible for different regions in the cluster and they coordinate to avoid action collisions on the region boundaries. Our emulation and real device experiments show SROLE reduces training time by 59% compared to MARL and centralized RL.

Comments:	Accepted in 2022 International Conference on Distributed Computing Systems (ICDCS)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2206.00774 [cs.DC]
	(or arXiv:2206.00774v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2206.00774

Submission history

From: Tanmoy Sen [view email]
[v1] Wed, 1 Jun 2022 21:32:44 UTC (8,806 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators