Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Yang, Yujiao; Lian, Jing; Li, Linhui

Computer Science > Machine Learning

arXiv:2503.02495 (cs)

[Submitted on 4 Mar 2025 (v1), last revised 6 Mar 2025 (this version, v2)]

Title:Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Authors:Yujiao Yang, Jing Lian, Linhui Li

View PDF

Abstract:We propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement selective routing on input data and experts. Our approach advances MoE design with four key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch-wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the UoE model surpass Full Attention, state-of-art MoEs and efficient transformers (including the model architecture of recently proposed DeepSeek-V3) in several tasks across image and natural language domains. In language modeling tasks, we achieve an average reduction of 2.38 in perplexity compared to the best-performed MoE method with an average of 76% FLOPs. In Long Range Arena benchmark, we recorded an average score that is at least 0.68% higher than all comparison models including Full Attention, MoEs, and transformer variants, with only 50% FLOPs of the best MoE method. In image classification, our model yielded an average accuracy improvement of 1.75% than the best model while maintaining comparable FLOPs. The source codes are available at this https URL.

Comments:	17 pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
MSC classes:	68T07
ACM classes:	I.5.1; I.2.0
Cite as:	arXiv:2503.02495 [cs.LG]
	(or arXiv:2503.02495v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2503.02495

Submission history

From: Yujiao Yang [view email]
[v1] Tue, 4 Mar 2025 11:01:25 UTC (3,761 KB)
[v2] Thu, 6 Mar 2025 08:51:47 UTC (3,748 KB)

Computer Science > Machine Learning

Title:Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators