A Comparison of Hybrid and End-to-End Models for Syllable Recognition

Bayerl, Sebastian P.; Riedhammer, Korbinian

doi:10.1007/978-3-030-27947-9_30

Cited by 4 publications

(3 citation statements)

References 23 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For improved speech recognition accuracy, DNN-HMM methods are best suited for languages with limited annotated speech [14]. Also, when much more text data is available than speech data, DNN-HMM models are the preferred choice [15,16] than the modern E2E approaches. Additionally, DNN-HMM ASR models offer the advantage of easy integration into small hardware devices, enabling fast on-device speech recognition [14].…”

Section: Baby Elephant Compound Word Formed By Agglutination Of Nouns...mentioning

confidence: 99%

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Manohar,

A R,

Rajan

2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

This article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–hidden Markov model (DNN-HMM)-based automatic speech recognition (ASR). We propose a novel method, syllable-byte pair encoding (S-BPE), that combines linguistically informed syllable tokenization with the data-driven tokenization method of byte pair encoding (BPE). The proposed method ensures words are always segmented at valid pronunciation boundaries. On a text corpus that has been divided into tokens using the proposed method, we construct statistical n-gram language models and assess the modeling effectiveness in terms of both information-theoretic and corpus linguistic metrics. A comparative study of the proposed method with other data-driven (BPE, Morfessor, and Unigram), linguistic (Syllable), and baseline (Word) tokenization algorithms is also presented. Pronunciation lexicons of subword tokenized units are built with pronunciation described as graphemes. We develop ASR systems employing the subword tokenized language models and pronunciation lexicons. The resulting ASR models are comprehensively evaluated to answer the research questions regarding the impact of subword tokenization algorithms on language modeling complexity and on ASR performance. Our study highlights the strong performance of the hybrid S-BPE tokens, achieving a notable 10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond $$n=3$$ n = 3 . Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.

show abstract

Section: Baby Elephant Compound Word Formed By Agglutination Of Nouns...mentioning

confidence: 99%

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Manohar,

A R,

Rajan

2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…The network contains 12 TDNNF layers of dimension 1024 as hidden layers, each comprised of one TDNN layer of dimension 1024 and two bottleneck layers of dimension 128. The input features are 40-dimensional Mel-frequency cepstral coefficients (MFCC) with Cepstral Mean and Variance Normalisation (CMVN), as do Bayerl and Riedhammer (2019), as well as classical Kaldi recipes. Adult and TL models are trained on 990 and 89 epochs respectively, both with a learning rate of 5e-4 and a 2 regularisation rate of 1e-2.…”

Section: Tdnnf-hmm: the Baselinementioning

confidence: 99%

“…We use, in this work, an end-to-end CTC architecture, named RNN-CTC and shown in Figure 1, which is composed of a simple encoder with recurrent neural networks. The input features are 40-dimensional MFCC with CMVN, as do Bayerl and Riedhammer (2019). The RNNs are composed of Bidirectional Gated Recurrent Unit layers (BiGRU) (Chung et al, 2015).…”

Section: Rnn-ctc Modelmentioning

confidence: 99%

End-to-end acoustic modelling for phone recognition of young readers

Gelin¹,

Daniel²,

Pinquier³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data, and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model complemented with a Connectionist Temporal Classification (CTC) objective function, reaches a phone error rate of 28.1%, outperforming a state-of-the-art DNN-HMM model by 6.6% relative, as well as other end-toend architectures by more than 8.5% relative. An analysis of the models' performance on two specific reading tasks (isolated words and sentences) is provided, showing the influence of the utterance length on attention-based and CTC-based models. The Transformer+CTC model displays an ability to better detect reading mistakes made by children, that can be attributed to the CTC objective function effectively constraining the attention mechanisms to be monotonic.

show abstract

Subword Speech Recognition for Agglutinative Languages

Valizada¹

2021

2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)

View full text Add to dashboard Cite

A Comparison of Hybrid and End-to-End Models for Syllable Recognition

Cited by 4 publications

References 23 publications

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

End-to-end acoustic modelling for phone recognition of young readers

Subword Speech Recognition for Agglutinative Languages

Contact Info

Product

Resources

About