MLTCP: Congestion Control for DNN Training

Rajasekaran, Sudarsanan; Narang, Sanjoli; Zabreyko, Anton A.; Ghobadi, Manya

Computer Science > Networking and Internet Architecture

arXiv:2402.09589 (cs)

[Submitted on 14 Feb 2024]

Title:MLTCP: Congestion Control for DNN Training

Authors:Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi

View PDF HTML (experimental)

Abstract:We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today's congestion control protocols is straightforward: by adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that enabling MLTCP accelerates the average and 99th percentile training iteration time by up to 2x and 4x, respectively.

Subjects:	Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2402.09589 [cs.NI]
	(or arXiv:2402.09589v1 [cs.NI] for this version)
	https://doi.org/10.48550/arXiv.2402.09589

Submission history

From: Sudarsanan Rajasekaran [view email]
[v1] Wed, 14 Feb 2024 21:33:18 UTC (2,971 KB)

Computer Science > Networking and Internet Architecture

Title:MLTCP: Congestion Control for DNN Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Networking and Internet Architecture

Title:MLTCP: Congestion Control for DNN Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators