Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Pan, Xinglin; Lin, Wenxiang; Shi, Shaohuai; Chu, Xiaowen; Sun, Weinong; Li, Bo

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2407.00599 (cs)

[Submitted on 30 Jun 2024 (v1), last revised 3 Jul 2024 (this version, v2)]

Title:Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Authors:Xinglin Pan, Wenxiang Lin, Shaohuai Shi, Xiaowen Chu, Weinong Sun, Bo Li

View PDF HTML (experimental)

Abstract:Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13$\times$ to 5.77$\times$ speedup on 1296 manually configured MoE layers and approximately 3$\times$ improvement on two real-world MoE models based on BERT and GPT-2.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2407.00599 [cs.DC]
	(or arXiv:2407.00599v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2407.00599

Submission history

From: Xinglin Pan [view email]
[v1] Sun, 30 Jun 2024 05:55:11 UTC (997 KB)
[v2] Wed, 3 Jul 2024 01:51:11 UTC (998 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators