Nursultan Jubatyrov scite author profile

Nursultan Jubatyrov

2Publications

3Citation Statements Received

184Citation Statements Given

How they've been cited

How they cite others

184

Affiliations

Meta (United Kingdom)

Publications

Order By: Most citations

Self-Healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Nikolic

Jubatyrov

Pournaras

2021

IEEE Trans. Netw. Serv. Manage.

View full text Add to dashboard Cite

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and general-purpose modeling of fault scenarios during system runtime. They are used to accurately measure and predict inconsistencies generated by the undesirable outcomes of fault correction and fault tolerance as the means to improve selfhealing of large-scale decentralized systems at the design phase. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds in a prototyped decentralized network of 3000 nodes. Almost 9 million measurements of inconsistencies were collected in a network, where each node monitors the health status of another node, while both can defect. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network data aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing at an early stage. Strikingly, the aggregation accuracy is well predicted as shown by high correlations and low root mean square errors.

show abstract

Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Nikolić¹,

Jubatyrov²,

Pournaras³

2020

Preprint

View full text Add to dashboard Cite

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencies generated by fault-correction and fault-tolerance when each node in a network can monitor the health status of another node, while both can defect. In contrast to related work, no information about the computational/application scenario, overlying algorithms or application data is required. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds, each with almost 9 million measurements of inconsistencies in a prototyped decentralized network of 3000 nodes. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing mechanisms at design phase. Strikingly, the aggregation accuracy is well predicted as shown by high correlations and low root mean square errors when calibration methods with application-independent features are applied.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.