Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Liu, Shangyu; Zheng, Zhenzhe; Huang, Xiaoyao; Wu, Fan; Chen, Guihai; Wu, Jie

Computer Science > Machine Learning

arXiv:2504.11197 (cs)

[Submitted on 15 Apr 2025 (v1), last revised 16 Apr 2025 (this version, v2)]

Title:Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Authors:Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Guihai Chen, Jie Wu

View PDF HTML (experimental)

Abstract:Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
Cite as:	arXiv:2504.11197 [cs.LG]
	(or arXiv:2504.11197v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.11197

Submission history

From: Shangyu Liu [view email]
[v1] Tue, 15 Apr 2025 13:53:08 UTC (448 KB)
[v2] Wed, 16 Apr 2025 03:32:23 UTC (476 KB)

Computer Science > Machine Learning

Title:Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators