Highly Available Data Parallel ML training on Mesh Networks

Kumar, Sameer; Jouppi, Norm

Computer Science > Machine Learning

arXiv:2011.03605 (cs)

[Submitted on 6 Nov 2020]

Title:Highly Available Data Parallel ML training on Mesh Networks

Authors:Sameer Kumar, Norm Jouppi

View PDF

Abstract:Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed chips for full connectivity. In this paper, we present techniques to route gradient summation allreduce traffic around failed chips on 2-D meshes. We evaluate performance of our fault tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT benchmarks. Performance results show minimal impact to training throughput on 512 and 1024 TPU-v3 chips.

Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2011.03605 [cs.LG]
	(or arXiv:2011.03605v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2011.03605

Submission history

From: Sameer Kumar [view email]
[v1] Fri, 6 Nov 2020 21:36:16 UTC (367 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2020-11

Change to browse by:

cs
cs.DC

References & Citations

DBLP - CS Bibliography

listing | bibtex

Sameer Kumar

export BibTeX citation

Computer Science > Machine Learning

Title:Highly Available Data Parallel ML training on Mesh Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Highly Available Data Parallel ML training on Mesh Networks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators