2012
DOI: 10.1145/2094072.2094076
|View full text |Cite
|
Sign up to set email alerts
|

Multiple testing in statistical analysis of systems-based information retrieval experiments

Abstract: High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
56
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 119 publications
(58 citation statements)
references
References 31 publications
2
56
0
Order By: Relevance
“…If t tests are conducted multiple times, the familywise error rate (i.e., the probability of detecting at least one nonexistent between-system difference) amounts to 1 À ð1 À aÞ mðmÀ1Þ=2 , assuming that all of these tests are independent of one another 8 It is now known that the t test actually behaves very similarly to distribution-free, computer-based tests, namely the bootstrap (Sakai 2006) and randomisation ) tests, even though historically IR researchers were cautious about the use of parametric tests (Jones and Willet 1997, p. 170;Van Rijsbergen 1979, p. 247). (Carterette 2012;Ellis 2010). 9 In contrast, the present study computes the required topic set size n by considering both the t test (for m ¼ 2) and one-way ANOVA (for m !…”
Section: Statistical Power Analysis By Webber/moffat/zobelmentioning
confidence: 99%
See 2 more Smart Citations
“…If t tests are conducted multiple times, the familywise error rate (i.e., the probability of detecting at least one nonexistent between-system difference) amounts to 1 À ð1 À aÞ mðmÀ1Þ=2 , assuming that all of these tests are independent of one another 8 It is now known that the t test actually behaves very similarly to distribution-free, computer-based tests, namely the bootstrap (Sakai 2006) and randomisation ) tests, even though historically IR researchers were cautious about the use of parametric tests (Jones and Willet 1997, p. 170;Van Rijsbergen 1979, p. 247). (Carterette 2012;Ellis 2010). 9 In contrast, the present study computes the required topic set size n by considering both the t test (for m ¼ 2) and one-way ANOVA (for m !…”
Section: Statistical Power Analysis By Webber/moffat/zobelmentioning
confidence: 99%
“…2.2, if the researcher is interested in the differences between every system pair, then conducting t tests multiple times is not the correct approach; an appropriate multiple comparison procedure (Boytsov et al 2013;Carterette 2012;Nagata 1998) should be applied in order to avoid the aforementioned familywise error rate problem. However, there are also cases where applying the t test multiple times is the correct approach to take even when there are more than two systems (m [ 2) (Nagata 1998).…”
Section: Theorymentioning
confidence: 99%
See 1 more Smart Citation
“…Also, we note that there are other measures besides the ones we study here, such as the d rank distance (Carterette 2009) or variations of the rank correlations (Melucci 2007). Similarly, in this paper we focused on the F -test because we were interested in simultaneously comparing a set of systems, but there are other statistical tests that can be used to compare individual pairs of systems, such as the t-test, Wilcoxon, bootstrap or permutation tests (Hull 1993;Sakai 2006;Smucker et al 2007;Urbano et al 2013a), which can be further coupled with methods to adjust p-values for multiple comparisons (Carterette 2012;Boytsov et al 2013). We leave these lines for further work as well, especially the study, via simulation, of the actual Type I and Type II error rates of various statistical significance tests.…”
Section: Discussionmentioning
confidence: 99%
“…This is particularly important for statistical measures, because they make a number of assumptions that are, by definition, not met in IR evaluation experiments (van Rijsbergen 1979;Hull 1993). The main reason is that effectiveness measures produce discrete values typically bounded by 0 and 1 (Carterette 2012). For instance, some measures of collection accuracy assume that score distributions are normally distributed 1 ; they are not because they are bounded.…”
Section: Introductionmentioning
confidence: 99%