Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Ghosh, Himel

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2411.15664 (cs)

[Submitted on 23 Nov 2024]

Title:Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Authors:Himel Ghosh

View PDF HTML (experimental)

Abstract:This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.

Comments:	12 pages, 7 figures, TUM Cloud Computing Seminar
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2411.15664 [cs.DC]
	(or arXiv:2411.15664v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2411.15664

Submission history

From: Himel Ghosh [view email]
[v1] Sat, 23 Nov 2024 22:19:37 UTC (724 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators