Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

Bhardwaj, Ankit; Phanishayee, Amar; Narayanan, Deepak; Tarta, Mihail; Stutsman, Ryan

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2311.18174 (cs)

[Submitted on 30 Nov 2023]

Title:Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

Authors:Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman

View PDF

Abstract:In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43$\times$ to 1.83$\times$ on a range of commonly used DNNs.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2311.18174 [cs.DC]
	(or arXiv:2311.18174v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2311.18174

Submission history

From: Ankit Bhardwaj [view email]
[v1] Thu, 30 Nov 2023 01:36:46 UTC (1,813 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators