Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests

Parry, Owain; Kapfhammer, Gregory M.; Hilton, Michael; McMinn, Phil

doi:10.1109/icst53961.2022.00021

Cited by 11 publications

(4 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We retrieved tests from the empirical study of flaky tests across programming languages of Costa et al [51] and from a recent study about pinpointing causes of flakiness by Habchi et al [52]. We also retrieved the flaky tests from iFixFlakies [17] as Test order dependency is a flakiness category that received a large interest in the community [9], [18], [54], [55].…”

Section: Discussionmentioning

confidence: 99%

“…Others investigated the use of test smells [13] and code metrics [33] for predicting flaky tests. Trying to outperform the performances of existing approaches, others relied on a mix of static and dynamic features, like FlakeFlagger [34] or Flake16 [35]. Fixing flakiness is also an aspect that has recently been investigated.…”

Section: Related Workmentioning

confidence: 99%

“…In the case of tree-based models, the reported information gain is given by the Gini importance (also known as Mean Decrease in Impurity) [48]. Parry et al [35] used SHapley Additive explanations (SHAP), which is another popular technique for model interpretability [49].…”

Section: Interpretability Techniquementioning

confidence: 99%

See 2 more Smart Citations

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Amal

Haben

Habchi

et al. 2023

2023 IEEE/ACM International Conference on Automation of Software Test (AST)

View full text Add to dashboard Cite

Flaky tests are tests that yield different outcomes when run on the same version of a program. This nondeterministic behaviour plagues continuous integration with false signals, wasting developers' time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat's predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Amal

Haben

Habchi

et al. 2023

2023 IEEE/ACM International Conference on Automation of Software Test (AST)

View full text Add to dashboard Cite

show abstract

“…Many existing flakiness detection approaches rely on information extracted at runtime: FlakeFlagger [15] and Flake16 [16] measure properties such as API usage, file-system access, memory usage, and threading behavior to extract features for training binary classifiers to distinguish flaky from nonflaky tests. Others go a step further and mutate the execution environment to expose flakiness by setting seeds of random number generators [19], switching implementations of methods with non-deterministic specifications [17], or adding noise to the execution environment [20].…”

Section: A Using Instrumentation or Language-specific Artifacts To De...mentioning

confidence: 99%

Practical Flaky Test Prediction using Common Code Evolution and Test History Data

Gruber¹,

Heine²,

Oster³

et al. 2023

Preprint

View full text Add to dashboard Cite

Non-deterministically behaving test cases cause developers to lose trust in their regression test suites and to eventually ignore failures. Detecting flaky tests is therefore a crucial task in maintaining code quality, as it builds the necessary foundation for any form of systematic response to flakiness, such as test quarantining or automated debugging. Previous research has proposed various methods to detect flakiness, but when trying to deploy these in an industrial context, their reliance on instrumentation, test reruns, or language-specific artifacts was inhibitive. In this paper, we therefore investigate the prediction of flaky tests without such requirements on the underlying programming language, CI, build or test execution framework. Instead, we rely only on the most commonly available artifacts, namely the tests' outcomes and durations, as well as basic information about the code evolution to build predictive models capable of detecting flakiness. Furthermore, our approach does not require additional reruns, since it gathers this data from existing test executions. We trained several established classifiers on the suggested features and evaluated their performance on a large-scale industrial software system, from which we collected a data set of 100 flaky and 100 non-flaky test-and code-histories. The best model was able to achieve an F1-score of 95.5 % using only 3 features: the tests' flip rates, the number of changes to source files in the last 54 days, as well as the number of changed files in the most recent pull request.

show abstract

Empirically evaluating flaky test detection techniques combining test case rerunning and machine learning models

Parry¹,

Kapfhammer²,

Hilton³

et al. 2023

Empir Software Eng

View full text Add to dashboard Cite

A flaky test is a test case whose outcome changes without modification to the code of the test case or the program under test. These tests disrupt continuous integration, cause a loss of developer productivity, and limit the efficiency of testing. Many flaky test detection techniques are rerunning-based, meaning they require repeated test case executions at a considerable time cost, or are machine learning-based, and thus they are fast but offer only an approximate solution with variable detection performance. These two extremes leave developers with a stark choice. This paper introduces CANNIER, an approach for reducing the time cost of rerunning-based detection techniques by combining them with machine learning models. The empirical evaluation involving 89,668 test cases from 30 Python projects demonstrates that CANNIER can reduce the time cost of existing rerunning-based techniques by an order of magnitude while maintaining a detection performance that is significantly better than machine learning models alone. Furthermore, the comprehensive study extends existing work on machine learning-based detection and reveals a number of additional findings, including (1) the performance of machine learning models for detecting polluter test cases; (2) using the mean values of dynamic test case features from repeated measurements can slightly improve the detection performance of machine learning models; and (3) correlations between various test case features and the probability of the test case being flaky.

show abstract

Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests

Cited by 11 publications

References 40 publications

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Practical Flaky Test Prediction using Common Code Evolution and Test History Data

Empirically evaluating flaky test detection techniques combining test case rerunning and machine learning models

Contact Info

Product

Resources

About