2018
DOI: 10.15622/sp.58.3
|View full text |Cite
|
Sign up to set email alerts
|

Improvements in Serbian Speech Recognition Using Sequence-Trained Deep Neural Networks

Abstract: Improvements in Serbian Speech Recognition using Sequence-Trained Deep Neural Networks.Abstract. This paper presents the recent improvements in Serbian speech recognition that were obtained by using contemporary deep neural networks based on sequence-discriminative training to train robust acoustic models. More specifically, several variants of the new large vocabulary continuous speech recognition (LVCSR) system are described, all based on the lattice-free version of the maximum mutual information (LF-MMI) tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…in [74] developed lattice-free maximum mutual information using phone n -gram language model starting from randomly initialized neural networks. This method was also successfully applied to Serbian [75]; i.e., the relative reduction of WER was about 25% with respect to the best previous system.…”
Section: Progress In Speech Recognition and Synthesis As Well As mentioning
confidence: 99%
“…in [74] developed lattice-free maximum mutual information using phone n -gram language model starting from randomly initialized neural networks. This method was also successfully applied to Serbian [75]; i.e., the relative reduction of WER was about 25% with respect to the best previous system.…”
Section: Progress In Speech Recognition and Synthesis As Well As mentioning
confidence: 99%
“…The used acoustic models were subsampled time-delay neural networks (TDNNs), which are trained using cross-entropy training within the so-called “chain” training method [17]. For this purpose, the Kaldi speech recognition toolkit [18] was used.…”
Section: Methodsmentioning
confidence: 99%
“…Alignments for the deep neural network (DNN) training were provided by a previously trained speaker-adaptive HMM-GMM (hidden Markov model—Gaussian mixture model) system [19] with 3500 states and 35000 Gaussians. Acoustic features used for DNN training were 40 high-resolution MFCC features (Mel-frequency cepstral coefficients), alongside their first- and second-order derivatives, as well as 3 pitch-based features—weighted log-pitch, delta-log-pitch, and warped normalized cross-correlation function (NCCF) value (which is originally between −1 and 1, and higher for voiced frames), and their derivatives, producing a 129-dimensional feature vector, which is a configuration already used in other experiments [5, 15, 17]. The context dependency tree used for the “chain” training with its special model topology that allows a subsampling factor of 3 had 2000 leaves (output states).…”
Section: Methodsmentioning
confidence: 99%
“…Different recurrent neural network based language models were trained and tested, as well as variants that use embedding vectors as word representations and incorporate additional lexical and morphological features [1] - [2]. On the other hand, the latest acoustic models involved purely sequence-trained deep neural networks with subsampling, specifically designed to better model longer temporal contexts [3] - [4]. These models included accent-specific vowel models, Melfrequency cepstral coefficients (MFCCs), pitch features and i-vectors [5] for the purpose of adaptation to different speakers and channels.…”
Section: Introductionmentioning
confidence: 99%