KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

Cheng, Rongxin; Peng, Yifan; Lai, Yuxin; Wei, Xingda; Chen, Rong; Chen, Haibo

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2412.18169 (cs)

[Submitted on 24 Dec 2024 (v1), last revised 26 Dec 2024 (this version, v2)]

Title:KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

Authors:Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen

View PDF HTML (experimental)

Abstract:The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning,causing latency spikes due to queuing incoming requests. However, state-of-the-art KVCache centric approaches handleload spikes by dropping, migrating, or swapping KVCache,which faces an essential tradeoff between the performance ofongoing vs. incoming requests and thus still severely this http URL paper makes a key observation such that model param-eters are independent of the requests and are replicated acrossGPUs, and thus proposes a parameter-centric approach byselectively dropping replicated parameters to leave preciousmemory for requests. However, LLM requires KVCache tobe saved in bound with model parameters and thus droppingparameters can cause either huge computation waste or longnetwork delay, affecting all ongoing requests. Based on the ob-servation that attention operators can be decoupled from otheroperators, this paper further proposes a novel remote attentionmechanism through pipeline parallelism so as to serve up-coming requests with the additional memory borrowed fromparameters on remote GPUs. This paper further addresses sev-eral other challenges including lively exchanging KVCachewith incomplete parameters, generating an appropriate planthat balances memory requirements with cooperative exe-cution overhead, and seamlessly restoring parameters whenthe throttling has gone. Evaluations show thatKUNSERVEreduces the tail TTFT of requests under throttling by up to 27.3x compared to the state-of-the-art.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.18169 [cs.DC]
	(or arXiv:2412.18169v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2412.18169

Submission history

From: Rongxin Cheng [view email]
[v1] Tue, 24 Dec 2024 05:07:46 UTC (19,758 KB)
[v2] Thu, 26 Dec 2024 03:28:03 UTC (19,749 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators