Finding the experts in the crowd: Validity and reliability of crowdsourced measures of children’s gradient speech contrasts

Harel, Daphna; Hitchcock, Elaine R.; Szeredi, Daniel; Ortiz, J. A.; Byun, Tara McAllister

doi:10.3109/02699206.2016.1174306

Cited by 19 publications

(21 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Validity relative to an external acoustic gold standard was found to be highly correlated with intra-rater reliability across repeated presentations, suggesting that reliability across repeated ratings could represent a useful method to screen raters. However, even when applying a relatively stringent standard and including only the 60 most reliable out of 120 raters, Harel et al (2016) continued to find considerable variability in bootstrapped resamples of n = 9 listeners. Because AMT raters need to be compensated for their work even in the context of a pre-screening test, it was judged impractical to apply an even stricter standard to select only raters with truly expert-like performance.…”

Section: Discussionmentioning

confidence: 99%

“…If there were an effective method to identify other crowdsourced listeners who exhibit expert- or near-expert-level performance, VAS ratings from as few as three listeners might yield valid information about the gradient properties of speech tokens. Recent work by Harel et al (2016) investigated this question of “finding the experts in the crowd.” The performance of individual AMT raters was assessed by measuring both reliability and validity of VAS ratings obtained when a small set of tokens was presented for repeated rating. Validity relative to an external acoustic gold standard was found to be highly correlated with intra-rater reliability across repeated presentations, suggesting that reliability across repeated ratings could represent a useful method to screen raters.…”

Section: Discussionmentioning

confidence: 99%

“…An intermediate option would be to collect ratings from a large number of students in speech pathology, who are a conveniently accessible population for most researchers, and whose services can usually be engaged at lower cost than experienced speech-language pathologists. Because students are neither experts nor truly naive, researchers should either recruit large numbers of students or use some form of pre-screening to identify students whose perceptual ratings are accurate and reliable (Harel et al, 2016). Finally, a typical clinical practitioner is unlikely to have the time and resources to obtain and analyze blinded listener ratings using currently available systems.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Deriving gradient measures of child speech from crowdsourced ratings

Byun

Harel

Halpin

et al. 2016

Journal of Communication Disorders

Self Cite

View full text Add to dashboard Cite

Recent research has demonstrated that perceptual ratings aggregated across multiple non-expert listeners can reveal gradient degrees of covert contrast between target and error sounds that listeners might transcribe identically. Aggregated ratings have been found to correlate strongly with acoustic gold standard measures both when individual raters use a continuous rating scale such as visual analog scaling (Munson, Johnson, & Edwards, 2012) and when individual raters provide binary ratings (McAllister Byun, Halpin, & Szeredi, 2015). In light of evidence that inexperienced listeners use continuous scales less consistently than experienced listeners, this study investigated the relative merits of binary versus continuous rating scales when aggregating responses over large numbers of naive listeners recruited through online crowdsourcing. Stimuli were words produced by children in treatment for misarticulation of North American English /r/. Each listener rated the same 40 tokens two times: once using Visual Analog Scaling (VAS) and once using a binary rating scale. The gradient rhoticity of each item was then estimated using (a) VAS click location, averaged across raters; (b) the proportion of raters who assigned the “correct /r/” label to each item in the binary rating task (p̂). First, we validate these two measures of rhoticity against each other and against an acoustic gold standard. Second, we explore the range of variability in individual response patterns that underlie these group-level data. Third, we integrate statistical, theoretical, and practical considerations to offer guidelines for determining which measure to use in a given situation.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Deriving gradient measures of child speech from crowdsourced ratings

Byun

Harel

Halpin

et al. 2016

Journal of Communication Disorders

Self Cite

View full text Add to dashboard Cite

show abstract

“…Blinded expert ratings were not available for within-treatment productions due to the large number of trials (N = 4,800) elicited in this context. However, Harel, Hitchcock, Szeredi, Ortiz, and McAllister Byun (2016) found that reliability measures estimated from a small subset of data were highly correlated with the same measure as derived from a larger subset of the data from the same individuals. Such findings suggest that one set of data can be a useful predictor of future ratings and support the possibility that the high level of agreement observed to hold between the clinician and the blinded raters on baseline and maintenance probes might also hold for within-session data.…”

Section: Measurementmentioning

confidence: 91%

Efficacy of Electropalatography for Treating Misarticulation of /r/

Hitchcock

Byun

Swartz

et al. 2017

Am J Speech Lang Pathol

View full text Add to dashboard Cite

The present findings support the hypothesis that EPG can improve production accuracy in some children with rhotic errors. However, the utility of EPG is likely to remain variable across individuals. For rhotics, EPG training emphasizes one possible tongue configuration consistent with accurate rhotic production (lateral tongue contact). Although some speakers respond well to this cue, the narrow focus may limit lingual exploration of other acceptable tongue shapes known to facilitate rhotic productions.

show abstract

“…To increase the reliability of these ratings, prior to completing experimental rating blocks, listeners were required to pass an eligibilitytesting block measuring the reliability with which they used the VAS to rate speech tokens. Following a protocol developed in previous research on the use of repeated ratings to identify high-performing individuals among crowdsourced listeners, 39 a sample of 30 tokens were repeated four times in random order, totaling 120 tokens. Raters were excluded if their intraclass correlation coefficient across the repeated ratings was lower than 0.8.…”

Section: Deanna Kawitzky and Tara Mcallistermentioning

confidence: 99%