Summary
Coincidental correctness (CC) arises when a defective program produces the correct output despite the fact that the defect within was exercised. Researchers have recognized the negative impact of CC, and the authors have previously conducted a study demonstrating its prevalence in test suites. However, that study was limited to system tests, and small subjects seeded with artificial defects. In this paper, we conduct a wider scope study of CC that addresses the following research questions in the context of the Defects4J benchmark. RQ1: Is CC prevalent in Defects4J? RQ2: Is CC affected by the testing levels in Defects4J? RQ3: Do CC tests induce peculiar infection paths in Defects4J? Furthermore, we use JTidy and NanoXML to address the following question. RQ4: Are the infections likely to be nullified within or outside the buggy method? To answer RQ1, we manually injected two code checkers for each of the 395 Defects4J defects: (i) a weak checker that detects weak CC tests by monitoring whether the defect was reached; and (ii) a strong checker that detects strong CC tests by monitoring whether the defect was reached and the program has transitioned into an infectious state. Our results showed that CC is prevalent in Defects4J, as we observed 38.1× more strong CC tests than failing tests and 60.5× more weak CC tests than failing tests. Testing has traditionally been classified into several levels that include unit, module, integration, system, and acceptance. Meanwhile, the test cases in Defects4J are not classified into any of the aforementioned testing levels. In addition, the boundaries between such levels are not clear because of the lack of a clear universal definition. Therefore, in order to answer RQ2, we derive the testing level of a test case from its method coverage information; specifically, we base it on the number and frequency of execution of the methods it covers. Our results showed that CC is present at all testing levels, but is more prevalent in high testing levels than in low testing levels. To answer RQ3, we contrasted the characteristics of the infection propagation paths induced by the Defects4J failing tests to those induced by the strong CC tests. We observed that the paths induced by the CC tests (i) were considerably longer on average and (ii) comprised a higher number of conditional, modulo, multiplication, division, and invocation statements. Finally, to answer RQ4, which relates to RQ2, we performed an experiment involving JTidy, NanoXML, and their associated high‐level test suites. We used code checkers to determine whether, in the case of strong CC, the infections were nullified before exiting the buggy function or afterward. All of our observations showed that the infections were nullified after exiting the buggy function. © 2019 John Wiley & Sons, Ltd.