Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Wang, Farui; Zhang, Weizhe; Lai, Shichao; Hao, Meng; Wang, Zheng

doi:10.1109/TPDS.2021.3137867

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2201.01684 (cs)

[Submitted on 5 Jan 2022]

Title:Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Authors:Farui Wang, Weizhe Zhang, Shichao Lai, Meng Hao, Zheng Wang

View PDF

Abstract:GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.

Comments:	Accepted to be published at IEEE Transactions on Parallel and Distributed System (IEEE TPDS)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2201.01684 [cs.DC]
	(or arXiv:2201.01684v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2201.01684
Related DOI:	https://doi.org/10.1109/TPDS.2021.3137867

Submission history

From: Zheng Wang [view email]
[v1] Wed, 5 Jan 2022 16:25:48 UTC (932 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators