Repeatable and reliable semantic search evaluation

Blanco, Roi; Halpin, Harry; Herzig, Daniel M.; Mika, Peter; Pound, Jeffrey; Thompson, Henry S.; Tran, Thanh

doi:10.1016/j.websem.2013.05.005

Cited by 28 publications

(17 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On one hand, Alonso and Mizzaro (2009) showed that crowdsourcing was a reliable way of providing relevance assessments, the same conclusion of a more recent study by Carvalho et al (2011). On the other hand, Clough et al (2012) and Blanco et al (2013) showed that, while crowdsourced assessments and expert-judges' assessments produce similar rankings of evaluated systems, they do not produce the same assessment scores. Blanco et al (2013) found that, in contrast to experts who are pessimistic in their scoring, non-expert judges accept more items as relevant.…”

Section: Repeatability and Reliabilitymentioning

confidence: 77%

“…It is indeed important to understand how this factor affects the reliability of an evaluation's results since it has been acknowledged in literature that the more knowledge and familiarity the judges have with the subject area, the less leniency they have for accepting documents as relevant (Rees and Schultz, 1967;Cuadra, 1967;Katter, 1968). Interestingly, Blanco et al (2013) analysed the impact of this factor on the reliability of the SemSearch evaluations and concluded that 1) experts are more pessimistic in their scoring and thus, accept fewer items as relevant when compared to workers (which agrees with the previous studies) and 2) crowdsourcing judgements, hence, cannot replace expert evaluations.…”

Section: Relevance Judgementsmentioning

confidence: 99%

“…The difference between the two evaluations is that TREC used expert judges, whereas SemSearch used Amazon Mechanical Turk workers in the assessment process. Blanco et al (2013) pointed out the limitation -in terms of scalability -of depending on a limited number of expert judges since, in repeating the evaluation by other researchers, it would be difficult if not impossible to use the same judges. Additionally, they showed that repeatability was successfully achieved through crowdsourced judgements since conducting the same experiment with two different pools of workers over a six-month period produced the same assessments and same rankings for the evaluated systems.…”

Section: Repeatability and Reliabilitymentioning

confidence: 99%

“…They conducted an experiment in which they showed that the cost of recruiting 73 workers on Amazon Mechanical Turk, for around 45 hours to judge the relevance of 924 results, was $43.00, while the expert judge cost was $106.02 for around 3 hours of work. Additionally, Blanco et al (2013) showed that, using Amazon Mechanical Turk, an entire SemSearch challenge was conducted for a total cost of $347.16. In this competition, 65 workers judged a total of 1737 assignments, covering 5786 submitted results from 14 different runs of 6 semantic search systems.…”

Section: Repeatability and Reliabilitymentioning

confidence: 99%

“…In this competition, 65 workers judged a total of 1737 assignments, covering 5786 submitted results from 14 different runs of 6 semantic search systems. Blanco et al (2013) thus considered this approach to be cost-effective. However, arguably, being "cost-effective" is very subjective: while it could be affordable for an organization or an evaluation campaign, it is more likely to cause difficulties for an individual researcher (e.g., a PhD student) and thus affect the repeatability criterion.…”

Section: Repeatability and Reliabilitymentioning

confidence: 99%

See 4 more Smart Citations