The motivation of this study is the poor performance of speech recognizers on the stop consonants. To overcome this weakness, word initial and word final stop consonants are modeled at a subphonemic (microsegmental) level. Each stop consonant is segmented into a few relatively stationary microsegments: silence, voice bar, burst, and aspiration. Microsegments of certain phonemically different stops are trained together due to their similar spectral properties. Microsegmental models of burst and aspiration are conditioned on the adjacent vowel category: front versus nonfront vowels. The resulting context-dependent microsegmental hidden Markov models (HMMs) for six stops possess the desired properties for a compromise between modeling accuracy and modeling robustness. They allow the recognizer to focus discrimination onto those regions of a stop that serve to distinguish it from other stops. Use of these models in recognition experiments for word lists consisting of CVC words reduces the error rate by 35% compared with the result obtained by using one HMM for each stop phoneme.
A combination of forward and backward long short-term memory (LSTM) recurrent neural network (RNN) language models is a popular model combination approach to improve the estimation of the sequence probability in the second pass N-best list rescoring in automatic speech recognition (ASR). In this work, we further push such an idea by proposing a combination of three models: a forward LSTM language model, a backward LSTM language model and a bi-directional LSTM based gap completion model. We derive such a combination method from a forward backward decomposition of the sequence probability. We carry out experiments on the Switchboard speech recognition task. While we empirically find that such a combination gives slight improvements in perplexity over the combination of forward and backward models, we finally show that a combination of the same number of forward models gives the best perplexity and word error rate (WER) overall.
In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and bytelevel byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-toend model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.