2015
DOI: 10.1080/00031305.2015.1005128
|View full text |Cite
|
Sign up to set email alerts
|

A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Abstract: In computational sciences, including computational statistics, machine learning, and bioinformatics, most abstracts of articles presenting new supervised learning methods end with a sentence like "our method performed better than existing methods on real data sets", e.g. in terms of error rate. However, these claims are often not based on proper statistical tests and, if such tests are performed (as usual in the machine learning literature), the tested hypothesis is not clearly defined and poor attention is de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
32
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 30 publications
(35 citation statements)
references
References 33 publications
2
32
0
Order By: Relevance
“…The superiority of RF tends to be more pronounced for increasing p and . More generally, our study outlines the importance of inclusion criteria and the necessity to include a large number of datasets in benchmark studies as outlined in previous literature [11, 28, 31]. …”
Section: Discussionsupporting
confidence: 58%
See 2 more Smart Citations
“…The superiority of RF tends to be more pronounced for increasing p and . More generally, our study outlines the importance of inclusion criteria and the necessity to include a large number of datasets in benchmark studies as outlined in previous literature [11, 28, 31]. …”
Section: Discussionsupporting
confidence: 58%
“…Considering the M ×2 matrix, collecting the performance measures for the two investigated methods (LR and RF) on the M considered datasets, one can perform a test for paired samples to compare the performances of the two methods [31]. We refer to the previously published statistical framework [31] for a precise mathematical definition of the tested null-hypothesis in the case of the t-test for paired samples.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…First of all, most are underpowered: with the exception of the study by de Souza et al [18] ( N =65 datasets) and that of Statnikov et al [17] ( N =22 datasets), they use between N =2 and N =11 datasets to compare the methods, too few to achieve reasonable power when comparing the performances of classification methods [9]. Following the sample size calculation approach outlined in Boulesteix et al [9], the required number of datasets to detect a difference in error rates between two methods of, say, 3% with a paired sample t-test at a significance level of 0.05 and a power of 80% is as high as N =43 if the standard deviation (over the datasets) of the difference in error rates is 7%—a standard deviation common in this setting [9]. …”
Section: Motivating Examplementioning
confidence: 99%
“…A simple example is that of sample size, an extensively researched question on the number of patients required in a clinical trial in order to make valid statistical claims on any result. Analogously, in benchmarking, in order to draw conclusions from real-data analysis beyond illustrative anecdotic statements, it is important to have considered an adequate number of datasets; see Boulesteix et al [9] for a discussion on the precise meaning of “an adequate number”. In the remainder of this paper, we discuss further concepts essential to formulating evidence-based statements in computational research using real datasets.…”
Section: Introductionmentioning
confidence: 99%