SubGen: Token Generation in Sublinear Time and Memory

Zandieh, Amir; Han, Insu; Mirrokni, Vahab; Karbasi, Amin

Computer Science > Machine Learning

arXiv:2402.06082 (cs)

[Submitted on 8 Feb 2024]

Title:SubGen: Token Generation in Sublinear Time and Memory

Authors:Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi

View PDF HTML (experimental)

Abstract:Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $\ell_2$ sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2402.06082 [cs.LG]
	(or arXiv:2402.06082v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.06082

Submission history

From: Insu Han [view email]
[v1] Thu, 8 Feb 2024 22:17:40 UTC (373 KB)

Computer Science > Machine Learning

Title:SubGen: Token Generation in Sublinear Time and Memory

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:SubGen: Token Generation in Sublinear Time and Memory

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators