Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

Jaiswal, Shashwat; Jain, Kunal; Simmhan, Yogesh; Parayil, Anjaly; Mallick, Ankur; Wang, Rujia; Amant, Renee St.; Bansal, Chetan; Rühle, Victor; Kulkarni, Anoop; Kofsky, Steve; Rajmohan, Saravan

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.14617 (cs)

[Submitted on 20 Feb 2025]

Title:Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

Authors:Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

View PDF HTML (experimental)

Abstract:Large Language Model (LLM) inference workloads handled by global cloud providers can include both latency-sensitive and insensitive tasks, creating a diverse range of Service Level Agreement (SLA) requirements. Managing these mixed workloads is challenging due to the complexity of the inference stack, which includes multiple LLMs, hardware configurations, and geographic distributions. Current optimization strategies often silo these tasks to ensure that SLAs are met for latency-sensitive tasks, but this leads to significant under-utilization of expensive GPU resources despite the availability of spot and on-demand Virtual Machine (VM) provisioning. We propose SAGESERVE, a comprehensive LLM serving framework that employs adaptive control knobs at varying time scales, ensuring SLA compliance while maximizing the utilization of valuable GPU resources. Short-term optimizations include efficient request routing to data center regions, while long-term strategies involve scaling GPU VMs out/in and redeploying models to existing VMs to align with traffic patterns. These strategies are formulated as an optimization problem for resource allocation and solved using Integer Linear Programming (ILP). We perform empirical and simulation studies based on production workload traces with over 8M requests using four open-source models deployed across three regions. SAGESERVE achieves up to 25% savings in GPU-hours while maintaining tail latency and satisfying all SLOs, and it reduces the scaling overhead compared to baselines by up to 80%, confirming the effectiveness of our proposal. In terms of dollar cost, this can save cloud providers up to $2M over the course of a month.

Comments:	15 pages, 17 figures, 2 tables
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2502.14617 [cs.DC]
	(or arXiv:2502.14617v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.14617

Submission history

From: Kunal Jain [view email]
[v1] Thu, 20 Feb 2025 14:57:08 UTC (3,429 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators