Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Hawks, Benjamin; Duarte, Javier; Fraser, Nicholas J.; Pappalardo, Alessandro; Tran, Nhan; Umuroglu, Yaman

doi:10.3389/frai.2021.676564

Computer Science > Machine Learning

arXiv:2102.11289 (cs)

[Submitted on 22 Feb 2021 (v1), last revised 19 Jul 2021 (this version, v2)]

Title:Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Authors:Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro Pappalardo, Nhan Tran, Yaman Umuroglu

View PDF

Abstract:Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.

Comments:	22 pages, 7 Figures, 1 Table
Subjects:	Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
Report number:	FERMILAB-PUB-21-056-SCD
Cite as:	arXiv:2102.11289 [cs.LG]
	(or arXiv:2102.11289v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2102.11289
Journal reference:	Front. AI 4, 94 (2021)
Related DOI:	https://doi.org/10.3389/frai.2021.676564

Submission history

From: Nhan Tran [view email]
[v1] Mon, 22 Feb 2021 19:00:05 UTC (5,275 KB)
[v2] Mon, 19 Jul 2021 22:11:41 UTC (5,252 KB)

Computer Science > Machine Learning

Title:Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators