Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Zhang, Yongkang; Yu, Haoxuan; Han, Chenxia; Wang, Cheng; Lu, Baotong; Li, Yang; Chu, Xiaowen; Li, Huaicheng

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2407.13996 (cs)

[Submitted on 19 Jul 2024 (v1), last revised 27 Jul 2024 (this version, v2)]

Title:Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Authors:Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

View PDF

Abstract:Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.

Comments:	18 pages, 18 figures
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Performance (cs.PF)
ACM classes:	D.4.9; I.2.5
Cite as:	arXiv:2407.13996 [cs.DC]
	(or arXiv:2407.13996v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2407.13996

Submission history

From: Yongkang Zhang [view email]
[v1] Fri, 19 Jul 2024 03:01:32 UTC (11,641 KB)
[v2] Sat, 27 Jul 2024 08:52:39 UTC (11,641 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators