BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Jang, Wonsuk; Tambe, Thierry

Computer Science > Computation and Language

arXiv:2501.01144 (cs)

[Submitted on 2 Jan 2025 (v1), last revised 3 Jan 2025 (this version, v2)]

Title:BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Authors:Wonsuk Jang, Thierry Tambe

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 11.83% (7.56%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.46% (2.65%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2501.01144 [cs.CL]
	(or arXiv:2501.01144v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.01144

Submission history

From: Wonsuk Jang [view email]
[v1] Thu, 2 Jan 2025 08:57:00 UTC (1,735 KB)
[v2] Fri, 3 Jan 2025 09:27:46 UTC (1,188 KB)

Computer Science > Computation and Language

Title:BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators