ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414920
|View full text |Cite
|
Sign up to set email alerts
|

Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition

Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attentionbased sequence-to-sequence model, computing word posteriors is difficult. An obvious alternative is to use the decoder softma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
15
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 34 publications
(15 citation statements)
references
References 31 publications
0
15
0
Order By: Relevance
“…[12]. The second one is based on the confidence estimation module (CEM) [10], an effective confidence method in ASR. In order to comply with the CEM input format, a two-layer GRU with 128 dimensions is added as the decoder, and the CEM module has one fully-connected layer with 256 units.…”
Section: Data Set and Model Configurationsmentioning
confidence: 99%
See 1 more Smart Citation
“…[12]. The second one is based on the confidence estimation module (CEM) [10], an effective confidence method in ASR. In order to comply with the CEM input format, a two-layer GRU with 128 dimensions is added as the decoder, and the CEM module has one fully-connected layer with 256 units.…”
Section: Data Set and Model Configurationsmentioning
confidence: 99%
“…For better performance, neural confidence estimation methods are drawing wide research interests to date. These works mainly focus on deriving increasingly discriminating set of features for the binary classifier under the specific structure like attentionbased sequence-to-sequence models [10] and RNN-T models [11,12]. However, these methods are highly sensitive to the model structure and the set of features extracted.…”
mentioning
confidence: 99%
“…However, for end-to-end (E2E) ASR models such as recurrent neural network transducers (RNN-T) and attention-based sequence-to-sequence models, word posteriors cannot be approximated well from the tree-like "lattice" where the prediction of each token conditions on the full history of previous tokens. Autoregressive decoders also tend to be overconfident [24]. To solve this challenge, several model-based methods have been proposed to estimate word and utterance-level confidence for E2E models.…”
Section: Introductionmentioning
confidence: 99%
“…To solve this challenge, several model-based methods have been proposed to estimate word and utterance-level confidence for E2E models. For examples, [25,24] proposed to train a token-level (e.g. graphemes or word-pieces [26]) confidence estimation module (CEM) on top of a given E2E model and the word-level confidence can be simply obtained by averaging the token-level scores.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation