Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Cong, Peizhuang; Yuan, Aomufei; Chen, Shimao; Tian, Yuxuan; Ye, Bowen; Yang, Tong

Computer Science > Machine Learning

arXiv:2404.16914 (cs)

[Submitted on 25 Apr 2024]

Title:Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Authors:Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, Tong Yang

View PDF HTML (experimental)

Abstract:MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2404.16914 [cs.LG]
	(or arXiv:2404.16914v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2404.16914

Submission history

From: Peizhuang Cong [view email]
[v1] Thu, 25 Apr 2024 15:39:59 UTC (3,545 KB)

Computer Science > Machine Learning

Title:Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators