Learned Best-Effort LLM Serving

Jha, Siddharth; Hooper, Coleman; Liu, Xiaoxuan; Kim, Sehoon; Keutzer, Kurt

Computer Science > Machine Learning

arXiv:2401.07886 (cs)

[Submitted on 15 Jan 2024 (v1), last revised 15 Jul 2024 (this version, v2)]

Title:Learned Best-Effort LLM Serving

Authors:Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

View PDF HTML (experimental)

Abstract:Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.

Comments:	Es-FoMo @ ICML 2024
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2401.07886 [cs.LG]
	(or arXiv:2401.07886v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2401.07886

Submission history

From: Siddharth Jha [view email]
[v1] Mon, 15 Jan 2024 18:28:17 UTC (542 KB)
[v2] Mon, 15 Jul 2024 03:54:20 UTC (556 KB)

Computer Science > Machine Learning

Title:Learned Best-Effort LLM Serving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learned Best-Effort LLM Serving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators