Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Li, Mingjie; Li, Zeyan; Yin, Kanglin; Nie, Xin; Zhang, Wenchi; Sui, Kaixin; Pei, Dan

doi:10.1145/3534678.3539041

Cited by 29 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the scoring step, we categorize the methods into random-walk-based and regression-based methods. We selected PageRank [40] as a random-walk-based method, following implementations in [15], [16], [24], [25] and HT [13], which is the only regression-based approach. As an exception, RCD does not have a separate scoring phase because it treats the failure as an intervention in the causal structure graph on the root fault metrics.…”

Section: ) Fault Localization Methodsmentioning

confidence: 99%

“…(ii) The anomaly-propagation methods localize root fault metrics by tracing the propagation of fault-induced anomalies in monitoring metrics [10]- [13], [15], [17], [20]- [22], [24]- [26], [35]. With these methods, fault localization is attributed to a source localization problem of signal propagation in complex networks, which is a well-studied problem in the field of network science [36].…”

Section: B Automated Fault Localizationmentioning

confidence: 99%

See 1 more Smart Citation

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

Tsubouchi,

Tsuruta

2024

IEEE Access

View full text Add to dashboard Cite

Automated fault localization in large-scale cloud-based applications is challenging because it involves mining multivariate time series data from large volumes of operational monitoring metrics. To improve localization accuracy, automated fault localization methods incorporate feature reduction to reduce the number of monitoring metrics unrelated to a failure. However, these methods have problems with inaccuracy, either from removing too many failure-related metrics or from retaining too few failureunrelated metrics. In this paper, we present MetricSifter, a feature reduction framework designed to accurately identify anomalous metrics caused by faults. Our framework locates a failure time window with the highest density of fault-induced change point times across monitoring metrics with a focus on their temporal proximity. Experimental results indicate that MetricSifter achieves an accuracy of 0.981, which is significantly better than the selected baseline methods. Furthermore, experiments combining various reduction methods with various localization methods demonstrate that MetricSifter improves the recall and time efficiency over the baseline methods.

show abstract

Section: ) Fault Localization Methodsmentioning

confidence: 99%

Section: B Automated Fault Localizationmentioning

confidence: 99%

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

Tsubouchi,

Tsuruta

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Furthermore, incorporating knowledge of the system architecture can improve the accuracy of the estimated causal graph by removing unnecessary or redundant connections between metrics and enforcing connections that are inherent in microservice systems. Some works [15,20] have developed a causal Bayesian network of the system using system knowledge and causal assumptions. However, to the best of our knowledge, no previous studies have combined instance-level variations in metric data with system knowledge to estimate a causal graph at the performance metric level, which is the main contribution of our research.…”

Section: Related Workmentioning

confidence: 99%

“…Throughout this paper, we consider these broad categories of metrics in our formulation, while individual monitoring metrics can be plugged into the categories. Similar to a previous approach [20], we define certain causal assumptions between the metric categories based on domain knowledge of system engineers to define a causal metric graph (Fig. 2).…”

Section: Metrics Data and Causal Assumptionsmentioning

confidence: 99%

“…However, none of the off-theshelf causal discovery algorithms [22,37,39] can handle the task of instance-level causal structure learning. Past works [11,15,20,27] build a causal graph only at an aggregate level and overlook the deployment strategy of microservices over multiple instances. However, even a few instances failing can degrade the quality of service, which might not be captured in an aggregate statistics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CausIL: Causal Graph for Instance Level Microservice Data

Chakraborty¹,

Garg²,

Agarwal³

et al. 2023

Proceedings of the ACM Web Conference 2023

View full text Add to dashboard Cite

AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ∼25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.

show abstract

Performance issue monitoring, identification and diagnosis of SaaS software: a survey

Wang,

Tian,

Ying

2024

Front. Comput. Sci.

View full text Add to dashboard Cite

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Cited by 29 publications

References 17 publications

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

CausIL: Causal Graph for Instance Level Microservice Data

Performance issue monitoring, identification and diagnosis of SaaS software: a survey

Contact Info

Product

Resources

About