ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Saxena, Utkarsh; Sharify, Sayeh; Roy, Kaushik; Wang, Xin

Computer Science > Machine Learning

arXiv:2412.14363 (cs)

[Submitted on 18 Dec 2024]

Title:ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Authors:Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

View PDF HTML (experimental)

Abstract:Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and a 2.4x speedup over 16-bit baseline. Code is available at this https URL.

Comments:	14 pages, 6 figures, 6 tables
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2412.14363 [cs.LG]
	(or arXiv:2412.14363v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.14363

Submission history

From: Utkarsh Saxena [view email]
[v1] Wed, 18 Dec 2024 22:01:55 UTC (5,340 KB)

Computer Science > Machine Learning

Title:ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators