ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Xiao, Zilin; Zhang, Hongming; Ge, Tao; Ouyang, Siru; Ordonez, Vicente; Yu, Dong

Computer Science > Computation and Language

arXiv:2410.05589 (cs)

[Submitted on 8 Oct 2024]

Title:ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Authors:Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu

View PDF HTML (experimental)

Abstract:Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost. Experimental results show that ParallelSpec accelerates baseline methods in latency up to 62% on text generation benchmarks from different domains, and it achieves 2.84X overall speedup on the Llama-2-13B model using third-party evaluation criteria.

Comments:	work in progress
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2410.05589 [cs.CL]
	(or arXiv:2410.05589v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.05589

Submission history

From: Zilin Xiao [view email]
[v1] Tue, 8 Oct 2024 01:05:08 UTC (2,858 KB)

Computer Science > Computation and Language

Title:ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators