Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-2203
|View full text |Cite
|
Sign up to set email alerts
|

Confidence Intervals for ASR-Based TTS Evaluation

Abstract: Automatic speech recognition (ASR) is increasingly used to evaluate the intelligibility of text-to-speech synthesis (TTS). ASR is less costly than traditional listening tests, but questions remain about its reliability. We re-evaluate the Blizzard Challenge's intelligibility tasks in English since 2011 using ASR. Re-analysing transcriptions collected by paid in-lab participants, online volunteers and Amazon Mechanical Turkers (the latter used only in 2011), we compare their word error rates (WERs) and statisti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…To evaluate the intelligibility of synthesized speech signals, we use the word error rate (WER) from Amazon speech recognition system 2 . As shown in [19], ASR-based metrics, e.g. WER, can perform reliably for evaluating intelligibility of TTS systems, and indeed on a comparable level to the paid human annotators/listeners.…”
Section: Intelligibilitymentioning
confidence: 86%
See 1 more Smart Citation
“…To evaluate the intelligibility of synthesized speech signals, we use the word error rate (WER) from Amazon speech recognition system 2 . As shown in [19], ASR-based metrics, e.g. WER, can perform reliably for evaluating intelligibility of TTS systems, and indeed on a comparable level to the paid human annotators/listeners.…”
Section: Intelligibilitymentioning
confidence: 86%
“…To prevent this, we use the intelligibility metric to measure the quality of synthetic speaker profiles and synthesized speech signals. Similar to [19], we use the word error rate (WER) of a given speech recognition model as the intelligibility: when the WER is smaller, it means the intelligibility of synthesized speech signals is better.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…To evaluate the TTS systems, we analysed the objective measures of word error rate (WER) and a speaker-encoder-based cosine similarity over the entire Obama test set. These two measures have recently been found to have a high correlation to the perceptual measures obtained in listening tests [29,30]. The WERs are extracted based on the automatic transcripts provided by the SpeechBrain ASR system [22], while the cosine similarity uses SpeechBrain's speaker embedding network.…”
Section: Objective Evaluation Of the Tts Modelsmentioning
confidence: 99%