Understanding and Detecting Software Upgrade Failures in Distributed Systems

Zhang, Yongle; Yang, Junwen; Jin, Zhuqi; Sethi, Utsav; Rodrigues, Kirk; Lu, Shan; Yuan, Di

doi:10.1145/3477132.3483577

Cited by 23 publications

(1 citation statement)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior work in this space has focused on two main directions. First, there has been several empirical studies on analyzing incidents and outages in production systems which have focused on studying incidents caused by certain type of issues [48]- [51] or issues from specific services and systems [52]- [54]. Second and more related to our work is the use of machine learning and data driven techniques for automating different aspects of incident lifecycle such as triaging [55], [56], diagnosis [57]- [59] and mitigation [5].…”

Section: A Incident Managementmentioning

confidence: 99%

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Ahmed¹,

Ghosh²,

Bansal³

et al. 2023

Preprint

View full text Add to dashboard Cite

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-ofthe-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

show abstract