Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, Jiaming; Zhao, Yilong; Zhu, Kan; Xiao, Guangxuan; Kasikci, Baris; Han, Song

Computer Science > Computation and Language

arXiv:2406.10774 (cs)

[Submitted on 16 Jun 2024 (v1), last revised 26 Aug 2024 (this version, v2)]

Title:Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Authors:Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

View PDF HTML (experimental)

Abstract:As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at this http URL .

Comments:	ICML 2024
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2406.10774 [cs.CL]
	(or arXiv:2406.10774v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.10774

Submission history

From: Jiaming Tang [view email]
[v1] Sun, 16 Jun 2024 01:33:02 UTC (1,202 KB)
[v2] Mon, 26 Aug 2024 21:01:02 UTC (1,213 KB)

Computer Science > Computation and Language

Title:Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators