SLO-Aware Scheduling for Large Language Model Inferences

Huang, Jinqi; Xiong, Yi; Yu, Xuebing; Huang, Wenjie; Li, Entong; Zeng, Li; Chen, Xin

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2504.14966 (cs)

[Submitted on 21 Apr 2025]

Title:SLO-Aware Scheduling for Large Language Model Inferences

Authors:Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, Xin Chen

View PDF HTML (experimental)

Abstract:Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request's distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance.
This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request's SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2504.14966 [cs.DC]
	(or arXiv:2504.14966v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2504.14966

Submission history

From: Jinqi Huang [view email]
[v1] Mon, 21 Apr 2025 08:48:48 UTC (552 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SLO-Aware Scheduling for Large Language Model Inferences

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SLO-Aware Scheduling for Large Language Model Inferences

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators