COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

Ibrahim, Shibal; Chen, Wenyu; Hazimeh, Hussein; Ponomareva, Natalia; Zhao, Zhe; Mazumder, Rahul

doi:10.1145/3580305.3599278

Abstract:The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to $100\times$ reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

Comments:	Accepted in KDD 2023
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2306.02824 [cs.LG]
	(or arXiv:2306.02824v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.02824
Related DOI:	https://doi.org/10.1145/3580305.3599278

Computer Science > Machine Learning

Title:COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators