FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Yu, Minchen; Wang, Ao; Chen, Dong; Yu, Haoxuan; Luo, Xiaonan; Li, Zhuohao; Wang, Wei; Chen, Ruichuan; Nie, Dapeng; Yang, Haoran

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2306.03622v1 (cs)

[Submitted on 6 Jun 2023 (this version), latest version 8 Feb 2024 (v2)]

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Authors:Minchen Yu, Ao Wang, Dong Chen, Haoxuan Yu, Xiaonan Luo, Zhuohao Li, Wei Wang, Ruichuan Chen, Dapeng Nie, Haoran Yang

View PDF

Abstract:The dynamic request patterns of machine learning (ML) inference workloads have driven an increasing trend towards exploiting serverless computing for scalable ML model serving. However, today's serverless platforms lack efficient support for GPUs -- provisioning functions on GPUs incurs extremely high overhead, forcing them to keep long-running even when idling for reduced cold starts. This leads to significant resource waste to perform ML inference and hinders the pay-per-use billing for GPUs.
In this paper, we present FaaSwap, a serverless platform enabling fine-grained, request-level GPU sharing for resource-efficient ML inference. FaaSwap leverages model swapping to support fast inference execution at low resource cost. It keeps models in a host which has a large amount of cheap memory and quickly swaps models to GPUs when requested, reducing per-function keep-alive cost and enabling efficient GPU sharing across much more functions. FaaSwap also supports swapping models between GPUs for load balancing and improved inference performance. In FaaSwap, we design sophisticated request scheduling and memory management algorithms that efficiently exploit model swapping to reduce GPU cost and meet latency service-level objectives (SLOs) for all inference functions. We have implemented and integrated FaaSwap into Alibaba Cloud Function Compute (FC), one of the world's largest commercial serverless platform. Evaluation results show that FaaSwap can achieve low-latency model swapping, efficiently share a GPU across hundreds of functions, and satisfy per-function latency SLOs at scale.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2306.03622 [cs.DC]
	(or arXiv:2306.03622v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2306.03622

Submission history

From: Minchen Yu [view email]
[v1] Tue, 6 Jun 2023 12:19:05 UTC (500 KB)
[v2] Thu, 8 Feb 2024 12:34:16 UTC (357 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators