Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Tao, Wei; Zhang, Bin; Qu, Xiaoyang; Wan, Jiguang; Wang, Jianzong

Computer Science > Computation and Language

arXiv:2503.23294 (cs)

[Submitted on 30 Mar 2025]

Title:Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Authors:Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, Jianzong Wang

View PDF HTML (experimental)

Abstract:Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

Comments:	Accepted by the Design, Automation, and Test in Europe 2025 (DATE 2025)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.23294 [cs.CL]
	(or arXiv:2503.23294v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.23294

Submission history

From: Jianzong Wang [view email]
[v1] Sun, 30 Mar 2025 03:20:34 UTC (2,081 KB)

Computer Science > Computation and Language

Title:Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators