One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Patke, Archit; Reddy, Dhemath; Jha, Saurabh; Qiu, Haoran; Pinto, Christian; Cui, Shengkun; Narayanaswami, Chandra; Kalbarczyk, Zbigniew; Iyer, Ravishankar

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2407.00047 (cs)

[Submitted on 5 Jun 2024]

Title:One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Authors:Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

View PDF HTML (experimental)

Abstract:$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources.
To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2407.00047 [cs.DC]
	(or arXiv:2407.00047v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2407.00047

Submission history

From: Archit Patke [view email]
[v1] Wed, 5 Jun 2024 21:17:34 UTC (5,052 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators