Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Lo, Yun-Chen; Wei, Gu-Yeon; Brooks, David

Computer Science > Hardware Architecture

arXiv:2412.19821 (cs)

[Submitted on 15 Dec 2024]

Title:Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Authors:Yun-Chen Lo, Gu-Yeon Wei, David Brooks

View PDF HTML (experimental)

Abstract:As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.

Comments:	12 pages, 12 figures
Subjects:	Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
ACM classes:	I.2.7; E.4
Cite as:	arXiv:2412.19821 [cs.AR]
	(or arXiv:2412.19821v1 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2412.19821

Submission history

From: Yun-Chen Lo [view email]
[v1] Sun, 15 Dec 2024 22:18:20 UTC (1,345 KB)

Computer Science > Hardware Architecture

Title:Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Hardware Architecture

Title:Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators