Exploring Erasure Coding Techniques for High Availability of Intermediate Data

Zhang, Zhe; Bockelman, Brian; Weitzel, Derek; Swanson, David

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2004.05729 (cs)

[Submitted on 13 Apr 2020]

Title:Exploring Erasure Coding Techniques for High Availability of Intermediate Data

Authors:Zhe Zhang, Brian Bockelman, Derek Weitzel, David Swanson

View PDF

Abstract:Scientific computing workflows generate enormous distributed data that is short-lived, yet critical for job completion time. This class of data is called intermediate data. A common way to achieve high data availability is to replicate data. However, an increasing scale of intermediate data generated in modern scientific applications demands new storage techniques to improve storage efficiency. Erasure Codes, as an alternative, can use less storage space while maintaining similar data availability. In this paper, we adopt erasure codes for storing intermediate data and compare its performance with replication. We also use the metric of Mean-Time-To-Data-Loss (MTTDL) to estimate the lifetime of intermediate data. We propose an algorithm to proactively relocate data redundancy from vulnerable machines to reliable ones to improve data availability with some extra network overhead. Furthermore, we propose an algorithm to assign redundancy units of data physically close to each other on the network to reduce the network bandwidth for reconstructing data when it is being accessed.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2004.05729 [cs.DC]
	(or arXiv:2004.05729v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2004.05729

Submission history

From: Zhe Zhang [view email]
[v1] Mon, 13 Apr 2020 00:13:01 UTC (2,384 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2020-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Zhe Zhang
Brian Bockelman
Derek Weitzel
David Swanson

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Exploring Erasure Coding Techniques for High Availability of Intermediate Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Exploring Erasure Coding Techniques for High Availability of Intermediate Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators