This project evaluates the accuracy of YouTube's automatically-generated captions across two genders and five dialects of English. Speakers' dialect and gender was controlled for by using videos uploaded as part of the "accent tag challenge", where speakers explicitly identify their language background. The results show robust differences in accuracy across both gender and dialect, with lower accuracy for 1) women and 2) speakers from Scotland. This finding builds on earlier research finding that speaker's sociolinguistic identity may negatively impact their ability to use automatic speech recognition, and demonstrates the need for sociolinguistically-stratified validation of systems.
This project compares the accuracy of two automatic speech recognition (ASR) systems-Bing Speech and YouTube's automatic captions-across gender, race and four dialects of American English. The dialects included were chosen for their acoustic dissimilarity. Bing Speech had differences in word error rate (WER) between dialects and ethnicities, but they were not statistically reliable. YouTube's automatic captions, however, did have statistically different WERs between dialects and races. The lowest average error rates were for General American and white talkers, respectively. Neither system had a reliably different WER between genders, which had been previously reported for YouTube's automatic captions [1]. However, the higher error rate non-white talkers is worrying, as it may reduce the utility of these systems for talkers of color.
Purpose
The present study was designed to evaluate use of spectral and temporal cues, under conditions where both types of cues were available.
Method
Participants included adults with normal hearing and hearing loss. We focused on three categories of speech cues: static spectral (spectral shape); dynamic spectral (formant change); and temporal (amplitude envelope). Spectral and/or temporal dimensions of synthetic speech were systematically manipulated along a continuum and recognition was measured using the manipulated stimuli. Level was controlled to ensure cue audibility. Discriminant function analysis was used to determine to what degree spectral and temporal information contributed to the identification of each stimulus.
Results
Listeners with normal hearing were influenced to a greater extent by spectral cues for all stimuli. Listeners with hearing impairment generally utilized spectral cues when the information was static (spectral shape) and but used temporal cues when the information was dynamic (formant transition). The relative use of spectral and temporal dimensions varied among individuals, especially among listeners with hearing loss.
Conclusions
Information about spectral and temporal cue use may aid in identifying listeners who rely to a greater extent on particular acoustic cues, and to apply that information toward therapeutic interventions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.