Localizing Faults in Cloud Systems

Mariani, Leonardo; Monni, Cristina; Pezzè, Mauro; Riganelli, Oliviero; Xin, Rui

doi:10.1109/icst.2018.00034

Cited by 58 publications

(63 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Beschastnikh et al [40] discuss the key features and debugging challenges of distributed systems and present a visualization tool named ShiViz. Leonardo et al [43] introduce a lightweight fault localization approach for cloud systems; it can localize faults with high precision, by relying on only lightweight positive training. In contrast to the preceding previous work, our work is the first to conduct delta debugging for microservice systems.…”

Section: Related Workmentioning

confidence: 99%

Delta Debugging Microservice Systems with Parallel Optimization

Zhou

Peng

Xie

et al. 2022

IEEE Trans. Serv. Comput.

View full text Add to dashboard Cite

Microservice systems are complicated due to their runtime environments and service communications. Debugging a failure involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimalize failure-inducing deltas of circumstances (e.g., deployment, environmental configurations). Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas during delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimalize, our approach defines a set of circumstance dimensions that can affect the execution of microservice systems. To automate the testing of deltas, our approach includes the design of an infrastructure layer for automating deployment and manipulation of microservice systems. To optimize the delta debugging process, our approach includes the design of parallel execution for delta testing tasks. Our evaluation shows that our approach is scalable and efficient with the provided infrastructure resources and the designed parallel execution for optimization. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes.

show abstract

Section: Related Workmentioning

confidence: 99%

Delta Debugging Microservice Systems with Parallel Optimization

Zhou

Peng

Xie

et al. 2022

IEEE Trans. Serv. Comput.

View full text Add to dashboard Cite

show abstract

“…Once the operation level degrades below a pre-set threshold, various restoration procedures must be carried out until the desired level of operation is achieved. Large-scale cloud systems might require additional steps in order to pinpoint the exact location of a fault [107].…”

Section: A Formal Resilience Orchestrationmentioning

confidence: 99%

Architectural Resilience in Cloud, Fog and Edge Systems: A Survey

Prokhorenko

Babar

2020

IEEE Access

View full text Add to dashboard Cite

An increasing number of large-scale distributed systems are being built by incorporating Cloud, Fog, and Edge computing. There is an important need of understanding how to ensure the resilience of systems built using Cloud, Fog, and Edge computing. This survey reports the state-of-the-art of architectural approaches that have been reported for ensuring the resilience of Cloud-, Fog-and Edge-based systems. This work reports a flexible taxonomy for reviewing architectural resilience approaches for distributed systems. In addition, this work also presents a capability-based cyber-foraging framework intended to improve the overall system resilience in the context of a physical node's capabilities. This survey also highlights the trust-related issues and solutions in the context of system resilience and reliability. This survey will help improve the understanding of the current state of system resilience solutions and raise awareness about the issues related to physical capabilities and trust management in the context of distributed systems resilience.

show abstract

“…Besides, Tuncer et al [ 35 ] proposed a new framework for detecting anomalies in HPC systems by clustering statistical features that retained application characteristics from the time series. On another hand, Mariani et al [ 37 ] proposed a new approach named LOUD that associated machine learning with graph centrality algorithms. LOUD analyzed KPIs metrics collected from the running systems using machine learning lightweight positive training.…”

Section: Related Workmentioning

confidence: 99%

Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Halawa

Redondo

Vilas

2020

Sensors

View full text Add to dashboard Cite

Performance analysis is an essential task in high-performance computing (HPC) systems, and it is applied for different purposes, such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of key performance indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper was to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we had applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician computation center (CESGA). We concluded that (i) those metrics (KPIs) related to the network (interface) traffic monitoring provided the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms were the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.

show abstract

Localizing Faults in Cloud Systems

Cited by 58 publications

References 39 publications

Delta Debugging Microservice Systems with Parallel Optimization

Delta Debugging Microservice Systems with Parallel Optimization

Architectural Resilience in Cloud, Fog and Edge Systems: A Survey

Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Contact Info

Product

Resources

About