2016
DOI: 10.3109/02699206.2016.1174306
|View full text |Cite
|
Sign up to set email alerts
|

Finding the experts in the crowd: Validity and reliability of crowdsourced measures of children’s gradient speech contrasts

Abstract: Perceptual ratings aggregated across multiple non-expert listeners can be used to measure covert contrast in child speech. Online crowdsourcing provides access to a large pool of raters, but for practical purposes, researchers may wish to use smaller samples. The ratings obtained from these smaller samples may not maintain the high levels of validity seen in larger samples. This study aims to measure the validity and reliability of crowdsourced continuous ratings of child speech, obtained through Visual Analog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
15
0
5

Year Published

2016
2016
2024
2024

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 19 publications
(21 citation statements)
references
References 39 publications
1
15
0
5
Order By: Relevance
“…Validity relative to an external acoustic gold standard was found to be highly correlated with intra-rater reliability across repeated presentations, suggesting that reliability across repeated ratings could represent a useful method to screen raters. However, even when applying a relatively stringent standard and including only the 60 most reliable out of 120 raters, Harel et al (2016) continued to find considerable variability in bootstrapped resamples of n = 9 listeners. Because AMT raters need to be compensated for their work even in the context of a pre-screening test, it was judged impractical to apply an even stricter standard to select only raters with truly expert-like performance.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Validity relative to an external acoustic gold standard was found to be highly correlated with intra-rater reliability across repeated presentations, suggesting that reliability across repeated ratings could represent a useful method to screen raters. However, even when applying a relatively stringent standard and including only the 60 most reliable out of 120 raters, Harel et al (2016) continued to find considerable variability in bootstrapped resamples of n = 9 listeners. Because AMT raters need to be compensated for their work even in the context of a pre-screening test, it was judged impractical to apply an even stricter standard to select only raters with truly expert-like performance.…”
Section: Discussionmentioning
confidence: 99%
“…If there were an effective method to identify other crowdsourced listeners who exhibit expert- or near-expert-level performance, VAS ratings from as few as three listeners might yield valid information about the gradient properties of speech tokens. Recent work by Harel et al (2016) investigated this question of “finding the experts in the crowd.” The performance of individual AMT raters was assessed by measuring both reliability and validity of VAS ratings obtained when a small set of tokens was presented for repeated rating. Validity relative to an external acoustic gold standard was found to be highly correlated with intra-rater reliability across repeated presentations, suggesting that reliability across repeated ratings could represent a useful method to screen raters.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Blinded expert ratings were not available for within-treatment productions due to the large number of trials (N = 4,800) elicited in this context. However, Harel, Hitchcock, Szeredi, Ortiz, and McAllister Byun (2016) found that reliability measures estimated from a small subset of data were highly correlated with the same measure as derived from a larger subset of the data from the same individuals. Such findings suggest that one set of data can be a useful predictor of future ratings and support the possibility that the high level of agreement observed to hold between the clinician and the blinded raters on baseline and maintenance probes might also hold for within-session data.…”
Section: Measurementmentioning
confidence: 91%
“…To increase the reliability of these ratings, prior to completing experimental rating blocks, listeners were required to pass an eligibilitytesting block measuring the reliability with which they used the VAS to rate speech tokens. Following a protocol developed in previous research on the use of repeated ratings to identify high-performing individuals among crowdsourced listeners, 39 a sample of 30 tokens were repeated four times in random order, totaling 120 tokens. Raters were excluded if their intraclass correlation coefficient across the repeated ratings was lower than 0.8.…”
Section: Deanna Kawitzky and Tara Mcallistermentioning
confidence: 99%