Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Godey, Nathan; Devoto, Alessio; Zhao, Yu; Scardapane, Simone; Minervini, Pasquale; de la Clergerie, Éric; Sagot, Benoît

Computer Science > Computation and Language

arXiv:2503.02812 (cs)

[Submitted on 4 Mar 2025]

Title:Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Authors:Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric de la Clergerie, Benoît Sagot

View PDF HTML (experimental)

Abstract:Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.02812 [cs.CL]
	(or arXiv:2503.02812v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.02812

Submission history

From: Nathan Godey [view email]
[v1] Tue, 4 Mar 2025 17:37:49 UTC (7,461 KB)

Computer Science > Computation and Language

Title:Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators