Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Ou, Jie; Chen, Yueming; Tian, Wenhong

Computer Science > Computation and Language

arXiv:2404.08698 (cs)

[Submitted on 10 Apr 2024 (v1), last revised 10 Jul 2024 (this version, v2)]

Title:Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Authors:Jie Ou, Yueming Chen, Wenhong Tian

View PDF HTML (experimental)

Abstract:While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2404.08698 [cs.CL]
	(or arXiv:2404.08698v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.08698

Submission history

From: Jie Ou [view email]
[v1] Wed, 10 Apr 2024 16:11:09 UTC (4,882 KB)
[v2] Wed, 10 Jul 2024 07:38:32 UTC (4,882 KB)

Computer Science > Computation and Language

Title:Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators