With the development of cloud computing technology, the microservice architecture (MSA) has become a prevailing application architecture in cloud-native applications. Many user-oriented services are supported by many microservices, and the dependencies between services are more complicated than those of a traditional monolithic architecture application. In such a situation, if an anomalous change happens in the performance metric of a microservice, it will cause other related services to be downgraded or even to fail, which would probably cause large losses to dependent businesses. Therefore, in the operation and maintenance job of cloud applications, it is critical to mine the causality of the problem and find its root cause as soon as possible. In this paper, we propose an approach for mining causality and diagnosing the root cause that uses knowledge graph technology and a causal search algorithm. We verified the proposed method on a classic cloud-native application and found that the method is effective. After applying our method on most of the services of a cloud-native application, both precision and recall were over 80%.
Accurately detecting anomalies and timely interventions are critical for cloud application maintenance. Traditional methods for performance anomaly detection based on thresholds and rules work well for simple key performance indicator (KPI) monitoring. Unfortunately, it is difficult to find the appropriate threshold levels when there are significant differences between KPI values at different times during the day or when there are significant fluctuations stemming from different usage patterns. Therefore, anomaly detection presents a challenge for all types of temporal data, particularly when non-stationary time series have special adaptability requirements or when the nature of potential anomalies is vaguely defined or unknown. To address this limitation, we propose a novel anomaly detector (called KPI-TSAD) for time-series KPIs based on supervised deep-learning models with convolution and long short-term memory (LSTM) neural networks, and a variational auto-encoder (VAE) oversampling model was used to address the imbalanced classification problem. Compared with other related research on Yahoo’s anomaly detection benchmark datasets, KPI-TSAD exhibited better performance, with both its accuracy and F-score exceeding 0.90 on the A1benchmark and A2Benchmark datasets. Finally, KPI-TSAD continued to perform well on several KPI monitoring datasets from real production environments, with the average F-score exceeding 0.72.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.