Problems with SZZ and features: An empirical study of the state of practice of defect prediction data collection

Herbold, Steffen; Trautsch, Alexander; Trautsch, Fabian; Ledel, Benjamin

doi:10.1007/s10664-021-10092-4

Cited by 45 publications

(47 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, we can not only compare pre-trained models to each other, but also their benefit over simpler models that can be trained in minutes without any special hardware. Second, the task is hard and often not solved correctly by humans, unless careful manual validation is used [43], [44], [45].…”

Section: Fine-tuned Prediction Tasksmentioning

confidence: 99%

“…We suggest to use the five projects from Herzig et al [44] and the 38 projects from Herbold et al [45] together and conduct a leave-one-project-out cross validation experiment. This means that one project is used for testing and the remaining 42 projects for the training of the classifier, i.e., of the a fastText model as baseline and for the fine-tuning of the BERT models.…”

Section: Fine-tuned Prediction Tasksmentioning

confidence: 99%

See 1 more Smart Citation

On the validity of pre-trained transformers for natural language processing in the software engineering domain

Mosel¹,

Trautsch²,

Herbold³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.

show abstract

Section: Fine-tuned Prediction Tasksmentioning

confidence: 99%

Section: Fine-tuned Prediction Tasksmentioning

confidence: 99%

On the validity of pre-trained transformers for natural language processing in the software engineering domain

Mosel¹,

Trautsch²,

Herbold³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our study subjects consist of 23 Java projects under the umbrella of the Apache Software Foundation 3 previously collected by (Herbold et al, 2020). Table 1 contains the list of our study subjects.…”

Section: Study Subjectsmentioning

confidence: 99%

“…The ITS usually has a kind of label or type to distinguish bugs from other issues, e.g., feature requests. However, research shows that this label is often incorrect, e.g., Antoniol et al (2008), Herzig et al (2013b) and Herbold et al (2020). Moreover, with this kind of software evolution research, we are interested in bugs existing in the software and not bugs which occur because of external factors, e.g., new environments or dependency upgrades.…”

Section: Study Subjectsmentioning

confidence: 99%

See 1 more Smart Citation

Are automated static analysis tools worth it? An investigation into relative warning density and external software quality

Trautsch¹,

Herbold²,

Grabowski³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Automated Static Analysis Tools (ASATs) are part of software development best practices. ASATs are able to warn developers about potential problems in the code. On the one hand, ASATs are based on best practices so there should be a noticeable effect on software quality. On the other hand, ASATs suffer from false positive warnings, which developers have to inspect and then ignore or mark as invalid. In this article, we ask the question if ASATs have a measurable impact on external software quality, using the example of PMD for Java. We investigate the relationship between ASAT warnings emitted by PMD on defects per change and per file. Our case study includes data for the history of each file as well as the differences between changed files and the project in which they are contained. We investigate whether files that induce a defect have more static analysis warnings than the rest of the project. Moreover, we investigate the impact of two different sets of ASAT rules. We find that, bug inducing files contain less static analysis warnings than other files of the project at that point in time. However, this can be explained by the overall decreasing warning density. When compared with all other changes, we find a statistically significant difference in one metric for all rules and two metrics for a subset of rules. However, the effect size is negligible in all cases, showing that the actual difference in warning density between bug inducing changes and other changes is small at best.

show abstract

A new perspective on the competent programmer hypothesis through the reproduction of real faults with repeated mutations

Ahmed,

Schwass,

Herbold

et al. 2024

Software Testing Verif & Rel

Self Cite

View full text Add to dashboard Cite

The competent programmer hypothesis is one of the fundamental assumptions of mutation testing, which claims that most programmers are competent enough to create correct or almost correct source code. This implies that faults should usually manifest through small variations of the correct code. Consequently, researchers assumed that the synthetic faults injected in source code through the mutation operators closely resemble the real faults. Unfortunately, it is still unclear whether the competent programmer hypothesis holds, as past research presents contradictory claims. Within this article, we provide a new perspective on the competent programmer hypothesis and its relation to mutation testing. We try to re‐create real‐world faults through chains of mutations to understand if there is a direct link between mutation testing and faults. The lengths of these paths help us to understand if the source code is really almost correct, or if large variations are required. Our experiments used a state‐of‐the‐art benchmark database of real faults named Defects4J 2.0.0. It contains 835 reproducible real‐world faults in 17 open‐source projects that comprise a total of 1044 bug‐fix pairs of files. Our results indicate that while the competent programmer hypothesis seems to be true, mutation testing is missing important operators to generate representative real‐world faults.

show abstract

Problems with SZZ and features: An empirical study of the state of practice of defect prediction data collection

Cited by 45 publications

References 85 publications

On the validity of pre-trained transformers for natural language processing in the software engineering domain

On the validity of pre-trained transformers for natural language processing in the software engineering domain

Are automated static analysis tools worth it? An investigation into relative warning density and external software quality

A new perspective on the competent programmer hypothesis through the reproduction of real faults with repeated mutations

Contact Info

Product

Resources

About