Speeding Up Distributed Machine Learning Using Codes

Lee, Kangwook; Lam, Maximilian; Pedarsani, Ramtin; Papailiopoulos, Dimitris; Ramchandran, Kannan

doi:10.1109/TIT.2017.2736066

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1512.02673 (cs)

[Submitted on 8 Dec 2015 (v1), last revised 29 Jan 2018 (this version, v3)]

Title:Speeding Up Distributed Machine Learning Using Codes

Authors:Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran

View PDF

Abstract:Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system failures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is $n$, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of $\log n$. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction $\alpha$ of the data matrix can be cached at each worker, and $n$ is the number of workers, \emph{coded shuffling} reduces the communication cost by a factor of $(\alpha + \frac{1}{n})\gamma(n)$ compared to uncoded shuffling, where $\gamma(n)$ is the ratio of the cost of unicasting $n$ messages to $n$ users to multicasting a common message (of the same size) to $n$ users. For instance, $\gamma(n) \simeq n$ if multicasting a message to $n$ users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.

Comments:	This work is published in IEEE Transactions on Information Theory and presented in part at the NIPS 2015 Workshop on Machine Learning Systems and the IEEE ISIT 2016
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:1512.02673 [cs.DC]
	(or arXiv:1512.02673v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1512.02673
Related DOI:	https://doi.org/10.1109/TIT.2017.2736066

Submission history

From: Kangwook Lee [view email]
[v1] Tue, 8 Dec 2015 21:54:04 UTC (2,376 KB)
[v2] Thu, 10 Dec 2015 19:34:37 UTC (2,376 KB)
[v3] Mon, 29 Jan 2018 03:04:14 UTC (833 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Speeding Up Distributed Machine Learning Using Codes

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Speeding Up Distributed Machine Learning Using Codes

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators