Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1750
|View full text |Cite
|
Sign up to set email alerts
|

Do End-to-End Speech Recognition Models Care About Context?

Abstract: The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can b… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 30 publications
0
6
0
Order By: Relevance
“…The performance of CTC significantly improved by using Conformer (B1, C1) and Mask-CTC greatly benefited from it (C2). The errors were further reduced by applying DLP (C3), achieving 9.1% in eval92 which was the best among the NAR models and better than that of the state-of-the-art model without LM [46,47]. By comparing results between NAR and AR models, Mask-CTC achieved highly competitive performance to AR models for both Transformer (A1, B3) and Conformer (A3, C3), demonstrating the effectiveness of the proposed methods for improving the original Mask-CTC.…”
Section: Resultsmentioning
confidence: 99%
“…The performance of CTC significantly improved by using Conformer (B1, C1) and Mask-CTC greatly benefited from it (C2). The errors were further reduced by applying DLP (C3), achieving 9.1% in eval92 which was the best among the NAR models and better than that of the state-of-the-art model without LM [46,47]. By comparing results between NAR and AR models, Mask-CTC achieved highly competitive performance to AR models for both Transformer (A1, B3) and Conformer (A3, C3), demonstrating the effectiveness of the proposed methods for improving the original Mask-CTC.…”
Section: Resultsmentioning
confidence: 99%
“…ASR is a deep neural network using a model based on Connectionist Temporal Classification. 28 To train the ASR model in the Swedish language, a total of 45 h of uniform random selected Swedish emergency calls from 2015 concerning all types of emergencies were manually transcribed into written text. All text files were then used as labels for the ASR model to be trained in understanding the Swedish language.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
“…For experiments with fine-tuning, we use language-specific BERT models 11 for German (Chan et al, 2020), Spanish (Canete et al, 2020), Dutch (de Vries et al, 2019, Finnish (Virtanen et al, 2019), Danish, 12 Croatain (Ulčar and Robnik-Šikonja, 2020), while we use mBERT (Devlin et al, 2019) for Afrikaans.…”
Section: Low Resource Settingmentioning
confidence: 99%