Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

Ray, Jaideep

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2412.14701 (cs)

[Submitted on 19 Dec 2024]

Title:Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

Authors:Jaideep Ray

View PDF

Abstract:Kubernetes offers a powerful orchestration platform for machine learning training, but memory management can be challenging due to specialized needs and resource constraints. This paper outlines how Kubernetes handles memory requests, limits, Quality of Service classes, and eviction policies for ML workloads, with special focus on GPU memory and ephemeral storage. Common pitfalls such as overcommitment, memory leaks, and ephemeral volume exhaustion are examined. We then provide best practices for stable, scalable memory utilization to help ML practitioners prevent out-of-memory events and ensure high-performance ML training pipelines.

Comments:	4 pages
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2412.14701 [cs.DC]
	(or arXiv:2412.14701v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2412.14701

Submission history

From: Jaideep Ray [view email]
[v1] Thu, 19 Dec 2024 10:10:57 UTC (171 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2024-12

Change to browse by:

cs
cs.LG

References & Citations

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators