2020
DOI: 10.1145/3392149
|View full text |Cite
|
Sign up to set email alerts
|

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Abstract: Root cause analysis in a large-scale production environment is challenging due to the complexity of the services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each described by hundr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…N, n the number of traces, the number of services m, c the type number of metrics, the collected number of each type metric Metric Anomaly Score: We use the mean µ ik and standard deviation σ ik of the service metrics to calculate service anomaly severity [20][21][22][23][24] . The µ ik is the expected normal value and the σ ik indicates that the metric deviates from the mean.…”
Section: Notation Definitionsmentioning
confidence: 99%
“…N, n the number of traces, the number of services m, c the type number of metrics, the collected number of each type metric Metric Anomaly Score: We use the mean µ ik and standard deviation σ ik of the service metrics to calculate service anomaly severity [20][21][22][23][24] . The µ ik is the expected normal value and the σ ik indicates that the metric deviates from the mean.…”
Section: Notation Definitionsmentioning
confidence: 99%
“…One of the most important capabilities of the AI-powered data-analytics for IT operations (AIOps) is fully automated RCA ([Gartner Research, 2019]). Many AIOps platform vendors like IBM (see [IBM, 2021]), Facebook (see [Lin et al, 2020]), VMware (see , Marvasti et al, 2014a, Marvasti et al, 2014b, Marvasti et al, 2016, Harutyunyan et al, 2020b, Harutyunyan et al, 2020c), HPE (see [HPE, 2019]), BigPanda (see [BigPanda, 2020]), DataDog (see [Othmane A.-A., 2021]), Moogsoft (se [Sahil K., 2016]) and others ([Moogsoft, 2016]) have almost complete vision and solution for the domain-centric RCA described in Figure 1.…”
Section: Related Workmentioning
confidence: 99%
“…It has been in the focus of researchers for decades with diverse ideas including anomaly detection, event correlations, causal inference, correlation analysis, predictive models and many others (see [ABS Consulting et al, 2014, Zawawy et al, 2010, Chuah et al, 2010, Cai et al, 2019, Marvasti et al, 2014b with references therein). Ideally, RCA should analyze all acquired monitoring datasets including logs (see , Harutyunyan et al, 2018a, Mi et al, 2012, Kostroš et al, 2014, Tak et al, 2016, Chuah et al, 2010, Bird et al, 2015, Zawawy et al, 2010, Michalski, 1983), traces (see [Suriadi et al, 2013, Lin et al, 2020) and time series data (see [Jeyakumar et al, 2019, Pearl, 2009, Spirtes et al, 2000) with possible correlations among them. Distributed tracing is the classical approach to application monitoring and diagnostics (see [Opentracing, 2019]).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations