Computer Science > Machine Learning
[Submitted on 4 Jul 2024 (v1), revised 19 Aug 2024 (this version, v2), latest version 6 Sep 2024 (v4)]
Title:HERA: High-efficiency Matrix Compression via Element Replacement
View PDF HTML (experimental)Abstract:Matrix quantization involves encoding matrix elements in a more space-efficient manner to minimize storage requirements, with dequantization used to reconstruct the original matrix for practical use. We define the Quantization Error Minimization (QEM) problem as minimizing the difference between a matrix before and after quantization while ensuring that the quantized matrix occupies the same amount of memory. Matrix quantization is essential in various fields, including weight quantization in Large Language Models (LLMs), vector databases, KV cache quantization, graph compression, and image compression. The growing scale of LLMs, such as GPT-4 and BERT, underscores the need for matrix compression due to the large size of parameters and KV caches, which are stored as matrices.
To address the QEM problem, we introduce HETA, an algorithm that leverages the local orderliness of matrix elements by iteratively swapping elements to create a locally ordered matrix. This matrix is then grouped and quantized by columns. To further improve HETA, we present two optimizations: additional quantization of residuals to reduce mean squared error (MSE) and the application of masking and batch processing to accelerate the algorithm.
Our experiments show that HETA effectively reduces MSE to 12.3% of its original value at the same compression ratio, outperforming leading baseline algorithms. Our contributions include formalizing the QEM problem, developing the HETA algorithm, and proposing two optimizations to enhance both accuracy and processing speed.
Submission history
From: Yanshu Wang [view email][v1] Thu, 4 Jul 2024 05:13:58 UTC (2,860 KB)
[v2] Mon, 19 Aug 2024 03:18:59 UTC (3,103 KB)
[v3] Wed, 21 Aug 2024 02:32:43 UTC (3,105 KB)
[v4] Fri, 6 Sep 2024 08:28:01 UTC (498 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
IArxiv Recommender
(What is IArxiv?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.