Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

Nikolic, Jovan; Jubatyrov, Nursultan; Pournaras, Evangelos

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2007.05261v2 (cs)

[Submitted on 10 Jul 2020 (v1), revised 12 Apr 2021 (this version, v2), latest version 24 Jun 2021 (v3)]

Title:Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

Authors:Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras

View PDF

Abstract:Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencies generated by fault-correction and fault-tolerance when each node in a network can monitor the health status of another node, while both can defect. In contrast to related work, no information about the computational/application scenario, overlying algorithms or application data is required. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds, each with almost 9M measurements of inconsistencies in a prototyped decentralized network of 3000 nodes. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing at design phase.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
Cite as:	arXiv:2007.05261 [cs.DC]
	(or arXiv:2007.05261v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2007.05261

Submission history

From: Evangelos Pournaras [view email]
[v1] Fri, 10 Jul 2020 09:10:00 UTC (623 KB)
[v2] Mon, 12 Apr 2021 17:50:13 UTC (7,380 KB)
[v3] Thu, 24 Jun 2021 16:34:40 UTC (7,379 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators