ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Liu, Xiang; Tang, Zhenheng; Dong, Peijie; Li, Zeyu; Li, Bo; Hu, Xuming; Chu, Xiaowen

Computer Science > Computation and Language

arXiv:2502.00299 (cs)

[Submitted on 1 Feb 2025]

Title:ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Authors:Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu

View PDF HTML (experimental)

Abstract:To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

Comments:	35 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.00299 [cs.CL]
	(or arXiv:2502.00299v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.00299

Submission history

From: Xiang Liu [view email]
[v1] Sat, 1 Feb 2025 03:49:47 UTC (713 KB)

Computer Science > Computation and Language

Title:ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators