A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Boulesteix, Anne‐Laure; Hable, Robert; Lauer, Sabine; Eugster, Manuel J. A.

doi:10.1080/00031305.2015.1005128

Cited by 30 publications

(35 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The superiority of RF tends to be more pronounced for increasing p and . More generally, our study outlines the importance of inclusion criteria and the necessity to include a large number of datasets in benchmark studies as outlined in previous literature [11, 28, 31]. …”

Section: Discussionsupporting

confidence: 58%

“…Considering the M ×2 matrix, collecting the performance measures for the two investigated methods (LR and RF) on the M considered datasets, one can perform a test for paired samples to compare the performances of the two methods [31]. We refer to the previously published statistical framework [31] for a precise mathematical definition of the tested null-hypothesis in the case of the t-test for paired samples.…”

Section: Methodsmentioning

confidence: 99%

“…We refer to the previously published statistical framework [31] for a precise mathematical definition of the tested null-hypothesis in the case of the t-test for paired samples. In this framework, the datasets play the role of the i.i.d.…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Random forest versus logistic regression: a large-scale benchmark experiment

2018

Self Cite

View full text Add to dashboard Cite

Background and goalThe Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.ResultsIn this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases.ConclusionRF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.Electronic supplementary materialThe online version of this article (10.1186/s12859-018-2264-5) contains supplementary material, which is available to authorized users.

show abstract

Section: Discussionsupporting

confidence: 58%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Random forest versus logistic regression: a large-scale benchmark experiment

2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…First of all, most are underpowered: with the exception of the study by de Souza et al [18] ( N =65 datasets) and that of Statnikov et al [17] ( N =22 datasets), they use between N =2 and N =11 datasets to compare the methods, too few to achieve reasonable power when comparing the performances of classification methods [9]. Following the sample size calculation approach outlined in Boulesteix et al [9], the required number of datasets to detect a difference in error rates between two methods of, say, 3% with a paired sample t-test at a significance level of 0.05 and a power of 80% is as high as N =43 if the standard deviation (over the datasets) of the difference in error rates is 7%—a standard deviation common in this setting [9]. …”

Section: Motivating Examplementioning

confidence: 99%

“…A simple example is that of sample size, an extensively researched question on the number of patients required in a clinical trial in order to make valid statistical claims on any result. Analogously, in benchmarking, in order to draw conclusions from real-data analysis beyond illustrative anecdotic statements, it is important to have considered an adequate number of datasets; see Boulesteix et al [9] for a discussion on the precise meaning of “an adequate number”. In the remainder of this paper, we discuss further concepts essential to formulating evidence-based statements in computational research using real datasets.…”

Section: Introductionmentioning

confidence: 99%

Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies

Boulesteix

Wilson

Hapfelmeier³

2017

BMC Med Res Methodol

Self Cite

View full text Add to dashboard Cite

BackgroundThe goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly “evidence-based”. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research.Main messageIn this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of “evidence-based” statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments.ConclusionWe suggest that benchmark studies—a method of assessment of statistical methods using real-world datasets—might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.

show abstract

On the role of benchmarking data sets and simulations in method comparison studies

Friedrich

Friede

2023

Biometrical J

View full text Add to dashboard Cite

Method comparisons are essential to provide recommendations and guidance for applied researchers, who often have to choose from a plethora of available approaches. While many comparisons exist in the literature, these are often not neutral but favor a novel method. Apart from the choice of design and a proper reporting of the findings, there are different approaches concerning the underlying data for such method comparison studies. Most manuscripts on statistical methodology rely on simulation studies and provide a single real‐world data set as an example to motivate and illustrate the methodology investigated. In the context of supervised learning, in contrast, methods are often evaluated using so‐called benchmarking data sets, that is, real‐world data that serve as gold standard in the community. Simulation studies, on the other hand, are much less common in this context. The aim of this paper is to investigate differences and similarities between these approaches, to discuss their advantages and disadvantages, and ultimately to develop new approaches to the evaluation of methods picking the best of both worlds. To this aim, we borrow ideas from different contexts such as mixed methods research and Clinical Scenario Evaluation.

show abstract

A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Cited by 30 publications

References 33 publications

Random forest versus logistic regression: a large-scale benchmark experiment

Random forest versus logistic regression: a large-scale benchmark experiment

Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies

On the role of benchmarking data sets and simulations in method comparison studies

Contact Info

Product

Resources

About