Problems with Statistical Practice in Human-Centric Software Engineering Experiments

Kitchenham, Barbara; Madeyski, Lech; Brereton, Pearl

doi:10.1145/3319008.3319009

Cited by 13 publications

(10 citation statements)

References 52 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During data extraction, it became clear that many of our 13 primary studies, included experiments with crossover designs. Vegas et al (2016) warned that the terminology used to describe crossover designs was not used consistently, and we found exactly the same problem with our primary studies (Kitchenham et al 2019a). Therefore, we used the description of the experimental design provided by the authors to derive our own classification.…”

Section: Experimental Methods Used By the Primary Studies (Rq2)mentioning

confidence: 94%

Meta-analysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment

2019

Self Cite

View full text Add to dashboard Cite

Context Previous studies have raised concerns about the analysis and meta-analysis of crossover experiments and we were aware of several families of experiments that used crossover designs and meta-analysis. Objective To identify families of experiments that used meta-analysis, to investigate their methods for effect size construction and aggregation, and to assess the reproducibility and validity of their results. Method We performed a systematic review (SR) of papers reporting families of experiments in high quality software engineering journals, that attempted to apply meta-analysis. We attempted to reproduce the reported meta-analysis results using the descriptive statistics and also investigated the validity of the meta-analysis process. Results Out of 13 identified primary studies, we reproduced only five. Seven studies could not be reproduced. One study which was correctly analyzed could not be reproduced due to rounding errors. When we were unable to reproduce results, we provide revised meta-analysis results. To support reproducibility of analyses presented in our paper, it is complemented by the reproducer R package. Conclusions Meta-analysis is not well understood by software engineering researchers. To support novice researchers, we present recommendations for reporting and meta-analyzing

show abstract

Section: Experimental Methods Used By the Primary Studies (Rq2)mentioning

confidence: 94%

Meta-analysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment

2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Finally, we also have observed some weaknesses in the experimenters' statistical knowledge. This problem was pointed out by other researchers, e.g., [17,28] previously. The ESEM community (and the overall SE community, as well) should establish measures to improve experimenters' statistical skills.…”

Section: Discussionmentioning

confidence: 85%

Publication Bias

Reyes¹,

Dieste²,

C.³

et al. 2020

Proceedings of the Evaluation and Assessment in Software Engineering

View full text Add to dashboard Cite

Background: Publication bias is the failure to publish the results of a study based on the direction or strength of the study findings. The existence of publication bias is firmly established in areas like medical research. Recent research suggests the existence of publication bias in Software Engineering. Aims: Finding out whether experiments published in the International Workshop on Empirical Software Engineering and Measurement (ESEM) are affected by publication bias. Method: We review experiments published in ESEM. We also survey with experimental researchers to triangulate our findings. Results: ESEM experiments do not define hypotheses and frequently perform multiple testing. One-tailed tests have a slightly higher rate of achieving statistically significant results. We could not find other practices associated with publication bias. Conclusions: Our results provide a more encouraging perspective of SE research than previous research: (1) ESEM publications do not seem to be strongly affected by biases and (2) we identify some practices that could be associated with p-hacking, but it is more likely that they are related to the conduction of exploratory research.

show abstract

“…Given our focus on journals, we extracted data from: Transactions on Software Engineering (TSE), Transactions on Software Engineering and Methodology (TOSEM), Empirical Software Engineering (EMSE), Journal of Systems and Software (JSS), and Information and Software Technology (IST). The same sample of journals was used in previous studies by Kitchenham et al [52]. The main reason for selecting these five journals is that they are well-known and top-ranked software engineering journals focusing primarily on applied scientific contributions.…”

Section: Screening and Selection Of Papersmentioning

confidence: 99%

“…Even though they are a subset of existing peer-reviewed publication venues, they comprise five popular and top-ranked SE journals. Also, the same sample has been used in similar ESE studies [52]. An alternative would have been to include conference papers as well, but we argue that the limited number of pages available to papers in conferences could hinder researchers to report thoroughly on their empirical studies, e.g., not including enough details on the choice and usage of statistics.…”

Section: Threats To Validitymentioning

confidence: 99%

“…Some studies focus on certain aspects of the experimental process, e.g., subject experience [89,43], bias [87], replication [48], or reporting [56,46,47]. The early work on guidelines on experimentation by Kitchenham et al [54], and the corresponding work on how to evaluate them [51], is especially relevant as researchers refine those guidelines as new studies are performed [95,52,89].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Evolution of statistical analysis in empirical software engineering research: Current state and steps forward

Neto

Torkar

Feldt

et al. 2019

Journal of Systems and Software

View full text Add to dashboard Cite

Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. This is not only crucial for research progress but also for practitioners in judging the practical significance. To investigate the practices and trends of statistical analysis in empirical software engineering (ESE), this paper presents a review of a large pool of papers from top-ranked software engineering journals. First, we manually reviewed 161 papers producing a review protocol based on a view of the recent state of art concerning statistical analysis and how researchers discuss practical significance. In a second phase of our method, we used the protocol as ground truth for a more extensive semi-automatic classification of papers spanning the years 2001-2015 targeting a total of 5,196 papers.We use the results from both review processes to: i) identify and analyse the predominant practices in ESE (e.g., using t-test or ANOVA), as well as relevant trends in usage of specific statistical methods (e.g., nonparametric tests and effect size measures); and ii) create a conceptual model for a statistical analysis workflow with suggestions on how to apply different statistical methods as well as guidelines to avoid pitfalls with their use, such as the arbitrary α cut-off and neglecting to correct for multiple tests. Additionally, we discuss different techniques to further expand the statistical toolkit of ESE researchers: Bayesian data analysis, techniques to handle missing data, and causal analysis.Lastly, we confirm existing claims that current ESE practices lack a standard to report practical significance of results. We illustrate how practical significance can be discussed in terms of both the statistical analysis and in the practitioner's context. 1 Even though the approaches mentioned cover Software Engineering (SE) as a whole, this paper focuses on the broad branch of all SE that is mainly interested in empirical data, i.e. Empirical Software Engineering (ESE).2 Even though the common approach of judging statistical significance by the use of p-values have recently come under severe scrutiny [96] we here talk about statistical significance in a broader sense of the word.

show abstract

Problems with Statistical Practice in Human-Centric Software Engineering Experiments

Cited by 13 publications

References 52 publications

Meta-analysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment

Meta-analysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment

Publication Bias

Evolution of statistical analysis in empirical software engineering research: Current state and steps forward

Contact Info

Product

Resources

About