Neural Network Compression using Binarization and Few Full-Precision Weights

Nardini, Franco Maria; Rulli, Cosimo; Trani, Salvatore; Venturini, Rossano

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.08960 (cs)

[Submitted on 15 Jun 2023 (v1), last revised 15 Sep 2023 (this version, v2)]

Title:Neural Network Compression using Binarization and Few Full-Precision Weights

Authors:Franco Maria Nardini, Cosimo Rulli, Salvatore Trani, Rossano Venturini

View PDF

Abstract:Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We extensively evaluate APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB delivers better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bit quantized model with no loss in accuracy.

Comments:	15 pages, 6 figures, 3 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
ACM classes:	I.2.6
Cite as:	arXiv:2306.08960 [cs.CV]
	(or arXiv:2306.08960v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.08960

Submission history

From: Franco Maria Nardini [view email]
[v1] Thu, 15 Jun 2023 08:52:00 UTC (7,062 KB)
[v2] Fri, 15 Sep 2023 12:13:30 UTC (6,050 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Neural Network Compression using Binarization and Few Full-Precision Weights

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Neural Network Compression using Binarization and Few Full-Precision Weights

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators