2019
DOI: 10.1007/978-3-030-27947-9_30
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Hybrid and End-to-End Models for Syllable Recognition

Abstract: This paper presents a comparison of a traditional hybrid speech recognition system (kaldi using WFST and TDNN with lattice-free MMI) and a lexiconfree end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training) models for German syllable recognition on the Verbmobil corpus. The results show that explicitly modeling prior knowledge is still valuable in building recognition systems. With a strong language model (LM) based on syllables, the structured approach significantly outperforms the end-to… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 23 publications
(21 reference statements)
0
3
0
Order By: Relevance
“…For improved speech recognition accuracy, DNN-HMM methods are best suited for languages with limited annotated speech [14]. Also, when much more text data is available than speech data, DNN-HMM models are the preferred choice [15,16] than the modern E2E approaches. Additionally, DNN-HMM ASR models offer the advantage of easy integration into small hardware devices, enabling fast on-device speech recognition [14].…”
Section: Baby Elephant Compound Word Formed By Agglutination Of Nouns...mentioning
confidence: 99%
“…For improved speech recognition accuracy, DNN-HMM methods are best suited for languages with limited annotated speech [14]. Also, when much more text data is available than speech data, DNN-HMM models are the preferred choice [15,16] than the modern E2E approaches. Additionally, DNN-HMM ASR models offer the advantage of easy integration into small hardware devices, enabling fast on-device speech recognition [14].…”
Section: Baby Elephant Compound Word Formed By Agglutination Of Nouns...mentioning
confidence: 99%
“…The network contains 12 TDNNF layers of dimension 1024 as hidden layers, each comprised of one TDNN layer of dimension 1024 and two bottleneck layers of dimension 128. The input features are 40-dimensional Mel-frequency cepstral coefficients (MFCC) with Cepstral Mean and Variance Normalisation (CMVN), as do Bayerl and Riedhammer (2019), as well as classical Kaldi recipes. Adult and TL models are trained on 990 and 89 epochs respectively, both with a learning rate of 5e-4 and a 2 regularisation rate of 1e-2.…”
Section: Tdnnf-hmm: the Baselinementioning
confidence: 99%
“…We use, in this work, an end-to-end CTC architecture, named RNN-CTC and shown in Figure 1, which is composed of a simple encoder with recurrent neural networks. The input features are 40-dimensional MFCC with CMVN, as do Bayerl and Riedhammer (2019). The RNNs are composed of Bidirectional Gated Recurrent Unit layers (BiGRU) (Chung et al, 2015).…”
Section: Rnn-ctc Modelmentioning
confidence: 99%