A Comparison of Sequence-to-Sequence Models for Speech Recognition

Kumar

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

et al. 2019

In this paper, we describe the Maximum Uniformity of Distribution (MUD) algorithm with the power-law nonlinearity. In this approach, we hypothesize that neural network training will become more stable if feature distribution is not too much skewed. We propose two different types of MUD approaches: power function-based MUD and histogram-based Thanks to Samsung Electronics for funding this research. The authors are thankful to Executive Vice President Seunghwan Cho and speech processing Lab. members at Samsung Research.

Section: Discussionsupporting

confidence: 67%

Section: Resultsmentioning

confidence: 79%

See 1 more Smart Citation

Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition

Kumar

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

et al. 2019

“…Although our goal was not to design a speech processing model that can compete with those used in the domain of automatic speech recognition (Li et al, 2014;Prabhavalkar et al, 2017;Sak et al, 2017), it turns out that the notion of neural oscillations could be relevant for the latter. Hyafil and Cernak (Hyafil and Cernak, 2015) demonstrated that a biophysically plausible theta oscillator which can syllabify speech on-line in a flexible manner makes a speech recognition system more resilient to noise and to variable speech rates.…”

Section: Discussionmentioning

confidence: 99%

Combining predictive coding with neural oscillations optimizes on-line speech processing

Hovsepyan¹,

Olasagasti²,

Giraud³

2018

Preprint

Speech comprehension requires segmenting continuous speech to connect it on-line with discrete linguistic neural representations. This process relies on theta-gamma oscillation coupling, which tracks syllables and encodes them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues predicting speech structure and content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line speech perception, we designed a generative model that can recognize syllable sequences in continuous speech. The model uses theta oscillations to detect syllable onsets and align both gamma-rate encoding activity with syllable boundaries and predictions with speech input. We observed that the model performed best when theta oscillations were used to align gamma units with input syllables, i.e. when bidirectional information flows were coordinated, and internal timing knowledge was exploited. This work demonstrates that notions of predictive coding and neural oscillations can usefully be brought together to account for dynamic on-line sensory processing.

“…Recently, End-to-end (E2E) neural network architectures based on sequence to sequence (seq2seq) learning for automatic speech recognition (ASR) have been gaining lots of attention [1,2], mainly because they can learn both the acoustic and the linguistic information, as well as the alignments between them, all simultaneously unlike the conventional ASR systems which were based on the hybrid models of hidden Markov models (HMMs) and deep neural network (DNN) models. Moreover, the E2E models are more suitable to be compressed since they do not need separate phonetic dictionaries and language models, making them one of the best candidates for on-device ASR systems.…”

Section: Introductionmentioning

confidence: 99%

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Jung

Lee

et al. 2019

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pretraining and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.