Distributed Speculative Inference of Large Language Models

Timor, Nadav; Mamou, Jonathan; Korat, Daniel; Berchansky, Moshe; Pereg, Oren; Wasserblat, Moshe; Galanti, Tomer; Gordon, Michal; Harel, David

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2405.14105 (cs)

[Submitted on 23 May 2024]

Title:Distributed Speculative Inference of Large Language Models

Authors:Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

View PDF HTML (experimental)

Abstract:Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution.
Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI.
Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2405.14105 [cs.DC]
	(or arXiv:2405.14105v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2405.14105

Submission history

From: Nadav Timor [view email]
[v1] Thu, 23 May 2024 02:14:17 UTC (171 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Speculative Inference of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Speculative Inference of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators