2017
DOI: 10.1162/tacl_a_00074
|View full text |Cite
|
Sign up to set email alerts
|

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Abstract: With the ever growing amounts of textual data from a large variety of languages, domains and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
40
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 51 publications
(40 citation statements)
references
References 49 publications
0
40
0
Order By: Relevance
“…While this paper focuses on the correct choice of a significance test, we also checked whether the papers in our sample account for the effect of multiple hypothesis testing when testing statistical significance (see (Dror et al, 2017)). When testing multiple hypotheses, as in the case of comparing the participating algorithms across a large number of datasets, the probability of making one or more false claims may be very high, even if the probability of drawing an erroneous conclusion in each individual comparison is small.…”
Section: Statistical Testmentioning
confidence: 99%
See 1 more Smart Citation
“…While this paper focuses on the correct choice of a significance test, we also checked whether the papers in our sample account for the effect of multiple hypothesis testing when testing statistical significance (see (Dror et al, 2017)). When testing multiple hypotheses, as in the case of comparing the participating algorithms across a large number of datasets, the probability of making one or more false claims may be very high, even if the probability of drawing an erroneous conclusion in each individual comparison is small.…”
Section: Statistical Testmentioning
confidence: 99%
“…We note that in this paper we do not deal with the problem of drawing valid conclusions from multiple comparisons between algorithms across a large number of datasets , a.k.a. replicability analysis (see (Dror et al, 2017)). Instead, our focus is on a single comparison: how can we make sure that the difference between the two algorithms, as observed in an individual comparison, is not coincidental.…”
Section: Introductionmentioning
confidence: 99%
“…We now repeat these analyses across twenty randomly generated 80%-10%-10% splits. After Dror et al (2017), we use the Bonferroni procedure to control familywise error rate, the probability of falsely rejecting at least one true null hypothesis. This is appropriate insofar as each individual trial (i.e, evaluation on a random split) has a non-trivial statistical dependence on other trials.…”
Section: Experiments 2: Reproductionmentioning
confidence: 99%
“…Since the distribution of our test data is unknown and the datasets are small, we perform a Wilcoxon signed-rank test for each hypothesis (Dror et al, 2018). Additionally, to counteract the multiple hypotheses problem, we apply the conservative Bonferroni correction, where the global null hypothesis is rejected if p < α/N , where N is the number of hypotheses (Dror et al, 2017). In our setting, α = 0.01 and N = 4 for EEG (one hypothesis per EEG data source), N = 59 for for fMRI (one hypothesis per participant of each fMRI data source), and N = 42 for eye-tracking (one hypothesis per feature per eye-tracking corpus).…”
Section: Multiple Hypotheses Testingmentioning
confidence: 99%