Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Scheinert, Dominik; Acker, Alexander; Thamsen, Lauritz; Geldenhuys, Morgan K.; Kao, Odej

doi:10.1109/cloudintelligence52565.2021.00011

Cited by 10 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Seer [23] trained deep learning algorithms on massive amounts of data to identify root causes, but its performance may degrade with system updates. Scheinert proposed Arvalus and its improved algorithm D-Arvalus [24]. In the two algorithms, the system components are regarded as microservices, and the dependencies between components are regarded as connections, to identify the root cause in a graph.…”

Section: Related Workmentioning

confidence: 99%

Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation Analysis

Yang,

Guo,

Chen

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

Microservice architecture has been widely adopted by large-scale applications. Due to the huge amount of data and complex microservice dependency, it also poses new challenges in ensuring reliable performance and maintenance. Existing approaches still suffer from limitations of anomaly data, over-simplification of metric relationships, and lack of diagnosing interpretability. To solve these issues, this paper builds a hierarchy root cause diagnosis framework, named Hi-RCA. We propose a global perspective to characterize different abnormal symptoms, which focuses on changes in metrics’ causation and correlation. We decompose the diagnosis task into two phases: anomalous microservice location and anomalous reason diagnosis. In the first phase, we use Kalman filtering to quantify microservice abnormality based on the estimation error. In the second phase, we use causation analysis to identify anomalous metrics and generate anomaly knowledge graphs; by correlation analysis, we construct an anomaly propagation graph and explain the anomaly symptoms via graph comparison. Our experimental evaluation on an open dataset shows that Hi-RCA can effectively locate root causes with 90% mean average precision, outperforming state-of-the-art methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation Analysis

Yang,

Guo,

Chen

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…When collected over time, metric data can provide an abstract representation of the state of each system component. As in our previous work [27], we define metric data as multivariate time series, i.e. a temporally ordered sequence of vectors S = (S t ∈ R d : t = 1, 2, .…”

Section: A Preliminariesmentioning

confidence: 99%

Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Scheinert

Aghdam

Becker

et al. 2022

2022 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for dataemitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models.In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.

show abstract

“…Anomalous traces are detected if their STVs do not follow the distribution. Scheinert et al 50 present a neural graph method to detect and localize anomalies. It models the components in the distributed cloud application as nodes and their dependencies as edges.…”

Section: Related Workmentioning

confidence: 99%

TraceRank: Abnormal service localization with dis‐aggregated end‐to‐end tracing data in cloud native systems

Huang

Chen

2021

J Software Evolu Process

View full text Add to dashboard Cite

Modern cloud native applications are generally built with a microservice architecture. To tackle various performance problems among a large number of services and machines, an end‐to‐end tracing tool is always equipped in these systems to track the execution path of every single request. However, it is nontrivial to conduct root cause analysis of anomalies with such a large volume of tracing data. This paper proposes a novel system named TraceRank to identify and locate abnormal services causing performance problems with dis‐aggregated end‐to‐end traces. TraceRank mainly includes an anomaly detection module and a root cause analysis module. The root cause analysis procedure is triggered when an anomaly is detected. To fully leverage the information provided by the tracing data, both the spectrum analysis and the PageRank‐based random walk methods are introduced to pinpoint abnormal services. The experiments in TrainTicket and Bookinfo microservice benchmarks and a real‐world system show that TraceRank can locate root causes with 90% in Precision and 86% in Recall. TraceRank has up to 10% improvement compared with several state‐of‐the‐art approaches in both Precision and Recall. Finally, TraceRank has good scalability and a low overhead to adapt to large‐scale microservice systems.

show abstract

Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Cited by 10 publications

References 20 publications

Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation Analysis

Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation Analysis

Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

TraceRank: Abnormal service localization with dis‐aggregated end‐to‐end tracing data in cloud native systems

Contact Info

Product

Resources

About