An industrial case study of automatically identifying performance regression-causes

Nguyen, Thanh An; Nagappan, Meiyappan; Hassan, Ahmed E.; Nasser, Mohamed; Flora, Parminder

doi:10.1145/2597073.2597092

Cited by 45 publications

(32 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…By now, such performance tests are academically well-understood, and recent research focuses on industrial applicability [8,21] or how to reduce the time necessary for load testing [11]. Studies of software microbenchmarking have not received main stream attention previously, but academics have recently started investigating it [5,12,24].…”

Section: Related Workmentioning

confidence: 99%

Performance testing in the cloud. How bad is it really?

Laaber

Scheuner

Leitner

2018

Preprint

View full text Add to dashboard Cite

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance testing hardware, making public clouds an attractive alternative. However, cloud environments are inherently unpredictable and variable with respect to their performance. In this study, we explore the effects of cloud environments on the variability of performance testing outcomes, and to what extent regressions can still be reliably detected. We focus on software microbenchmarks as an example of performance tests, and execute extensive experiments on three different cloud services (AWS, GCE, and Azure) and for different types of instances. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (from 0.03% to > 100% relative standard deviation). We also observe that testing using Wilcoxon rank-sum generally leads to unsatisfying results for detecting regressions due to a very high number of false positives in all tested configurations. However, simply testing for a difference in medians can be employed with good success to detect even small differences. In some cases, a difference as low as a 1% shift in median execution time can be found with a low false positive rate given a large sample size of 20 instances.

show abstract

Section: Related Workmentioning

confidence: 99%

Performance testing in the cloud. How bad is it really?

Laaber

Scheuner

Leitner

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…The CPU saturation can be due to an unplanned increase in the workload volume. It can also be due to software regression bug, i.e., due to an updated feature of an application in which developers forget to remove the additional executed logic as part of their debugging activity [16]. Even a small set of additional calculations added to a part of the source code which is executed frequently can produce a dramatic increase in CPU usage.…”

Section: ) Anomaliesmentioning

confidence: 99%

“…Accessing I/O storage devices, such as hard drives, are usually among the slowest part of a transaction. Changes to I/O operation in an execution can even cause performance regression (i.e., performance discontinuity) [16]. Adding log statements to execution is a common mistake [23].…”

Section: Experiments 2 (Memory Stress and Change In Transactionmentioning

confidence: 99%

“…Closest work to ours is the work done by Foo [4,27] and Naguyen [16,28]. Both use performance counters to automate the analysis of performance test an automatically identify performance anomalies in the system.…”

Section: A Pre-deployment Anomaly Detection In Large-scale Systemmentioning

confidence: 99%

“…In contrast, our approach can pinpoint the time duration at which the anomaly and discontinuity occurs and for how long it prevails, i.e., its transition period. Nguyen [16,28] used a quality control technique called control charts to flag the anomalies in the performance counters using upper and lower bound limits. Their technique requires deep understanding of the domain to create control limit of performance counters.…”

Section: A Pre-deployment Anomaly Detection In Large-scale Systemmentioning

confidence: 99%

See 2 more Smart Citations

Detecting Discontinuities in Large Scale Systems

Malik

Davis

Godfrey

et al. 2014

2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing

View full text Add to dashboard Cite

Abstract-Cloud providers and data centers rely heavily on forecasts to accurately predict future workload. This information helps them in appropriate virtualization and cost-effective provisioning of the infrastructure. The accuracy of a forecast greatly depends upon the merit of performance data fed to the underlying algorithms. One of the fundamental problems faced by analysts in preparing data for use in forecasting is the timely identification of data discontinuities. A discontinuity is an abrupt change in a time-series pattern of a performance counter that persists but does not recur. Analysts need to identify discontinuities in performance data so that they can a) remove the discontinuities from the data before building a forecast model and b) retrain an existing forecast model on the performance data from the point in time where a discontinuity occurred. There exist several approaches and tools to help analysts identify anomalies in performance data. However, there exists no automated approach to assist data center operators in detecting discontinuities in the first place. In this paper, we present and evaluate our proposed approach to help data center analysts and cloud providers automatically detect discontinuities. A case study on the performance data obtained from a large cloud provider and performance tests conducted using an open source benchmark system show that our proposed approach provides on average precision of 84% and recall 88%. The approach doesn't require any domain knowledge to operate.

show abstract

Localizing software performance regressions in web applications by comparing execution timelines

Ocariza¹,

Zhao²

2020

Software Testing Verif & Rel

View full text Add to dashboard Cite

A performance regression in software is defined as an increase in an application step's response time as a result of code changes. Detecting such regressions can be done using profiling tools; however, investigating their root cause is a mostly-manual and time-consuming task. This statement holds true especially when comparing execution timelines, which are dynamic function call trees augmented with response time data; these timelines are compared to find the performance regression-causesthe lowest-level function calls that regressed during execution. When done manually, these comparisons often require the investigator to analyze thousands of function call nodes. Further, performing these comparisons on web applications is challenging due to JavaScript's asynchronous and event-driven model, which introduce noise in the timelines. In response, we propose a design -ZAMthat automatically compares execution timelines collected from web applications, to identify performance regression-causes. Our approach uses a hybrid node matching algorithm that recursively attempts to find the longest common subsequence in each call tree level, then aggregates multiple comparisons' results to eliminate noise. Our evaluation of ZAM on 10 web applications indicates that it can identify performance regression-causes with a path recall of 100% and a path precision of 96%, while performing comparisons in under a minute on average. We also demonstrate the real-world applicability of ZAM, which has been used to successfully complete performance investigations by the performance and reliability team in SAP.

show abstract

An industrial case study of automatically identifying performance regression-causes

Cited by 45 publications

References 25 publications

Performance testing in the cloud. How bad is it really?

Performance testing in the cloud. How bad is it really?

Detecting Discontinuities in Large Scale Systems

Localizing software performance regressions in web applications by comparing execution timelines

Contact Info

Product

Resources

About