A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark

Yang, He; Martínez, Matías; Durieux, Thomas; Monperrus, Martin

doi:10.1109/ibf.2019.8665475

Cited by 35 publications

(25 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, in the paper that jGenProg [27] is presented, there is an evaluation on Defects4J: this evaluation has no citation in the second column of the table because the evaluation is in jGenProg's paper. Later, it was evaluated again on Defects4J [26] and also on QuixBugs [48], which contain citations of the empirical evaluation papers in the table. The table also presents additional information on the evaluations, which are the number of bugs given as input to the repair tools, and the number of bugs for which the tools generated a test-suite adequate patch (i.e.…”

Section: State Of Affairs On Test-suite-based Automatic Repair Tools mentioning

confidence: 99%

“…They also found that a small number of bugs (9/47) could be repaired with a test-suite adequate patch that is also correct. Ye et al [48] presented a study where nine repair tools were executed on the bugs from QuixBugs. They used automatically generated test cases based on the human-written patches to identify incorrect patches generated by the repair tools.…”

Section: Related Workmentioning

confidence: 99%

“…[50]), or when a dedicated full contribution on evaluating existing repair tools is reported (e.g. [26,33,48]). The evaluations consist of four main aspects in general: 1) [benchmark] the selection of benchmarks of bugs; 2) [execution] the collection of data by executing repair tools on the selected bugs; 3) [observed aspect] an investigation on the effectiveness of the repair approach regarding some criteria (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Durieux

Madeiral

Martínez

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

Self Cite

122

124

View full text Add to dashboard Cite

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.

show abstract

Section: State Of Affairs On Test-suite-based Automatic Repair Tools mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Durieux

Madeiral

Martínez

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

Self Cite

122

124

View full text Add to dashboard Cite

show abstract

“…The other aspect is test adequacy of the buggy class. We use line coverage and branch coverage to measure it as existing studies do [49,50]. Our intuition is that the test quality measured by line and branch coverages is related with the type of correct patches generated for this bug since existing studies have shown that the correctness (i.e., plausible, overfitting, or correct) of APR generated patches has strong correlation with the test quality [32,34,36].…”

Section: Research Questionsmentioning

confidence: 99%

“…The data of patch complexity is from the previous study [48] in which the characteristics of each bug in Defects4J have been analyzed. The data of test adequacy is calculated by Cobertura 1 which is a free Java tool being widely-used in recent studies [49,50]. If different types of patches are generated for the same bug, the data of this bug is added into all the relevant types for analysis.…”

Section: Bug Characteristicsmentioning

confidence: 99%

How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques

Wang

Wen

Chen

et al. 2019

2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

View full text Add to dashboard Cite

Background: Over the years, Automated Program Repair (APR) has attracted much attention from both academia and industry since it can reduce the costs in fixing bugs. However, how to assess the patch correctness remains to be an open challenge. Two widely adopted ways to approach this challenge, including manually checking and validating using automated generated tests, are biased (i.e., suffering from subjectivity and low precision respectively). Aim: To address this concern, we propose to conduct an empirical study towards understanding the correct patches that are generated by existing state-of-the-art APR techniques, aiming at providing guidelines for future assessment of patches. Method: To this end, we first present a Literature Review (LR) on the reported correct patches generated by recent techniques on the Defects4J benchmark and collect 177 correct patches after a process of sanity check. We investigate how these machine-generated correct patches achieve semantic equivalence, but syntactic difference compared with developerprovided ones, how these patches distribute in different projects and APR techniques, and how the characteristics of a bug affect the patches generated for it. Results: Our main findings include 1) we do not need to fix bugs exactly like how developers do since we observe that 25.4% (45/177) of the correct patches generated by APR techniques are syntactically different from developerprovided ones; 2) the distribution of machine-generated correct patches diverges for the aspects of Defects4J projects and APR techniques; and 3) APR techniques tend to generate patches that are different from those by developers for bugs with large patch sizes. Conclusion: Our study not only verifies the conclusions from previous studies but also highlights implications for future study towards assessing patch correctness. Keywords-Automated Program Repair; Defects4J; patch correctness assessment.RQ1 How do machine-generated correct patches differ from developer-provided ones?RQ2 How do different types of patches distribute? RQ3 Do APR tools tend to generate correct patches but different from the developer-provided ones for bugs with certain characteristics?A patch is generated based on the buggy location identified by fault localization techniques (i.e., denoted as edit point in this study) with certain code modifications. Based on this, the differences between patches can be distinguished in terms of two aspects, edits points and code modifications. To answer RQ1, we compare the collected patches with developerprovided ones and classify them into four types based on the aforementioned two aspects. We further investigate how the patches that are syntactically different from developer-provided ones achieve semantic equivalence. In RQ2, we investigate the distribution of patches from two aspects (i.e., different De-fects4J projects and APR techniques) and observe that fault localization is critical for generating correct patches for bugs in three projects of Defects4J. In RQ3, we aim at investigating whethe...

show abstract