Advances in all-neural speech recognition

Zweig, Geoffrey; Yu, Chengzhu; Droppo, Jasha; Stolcke, Andreas

doi:10.1109/icassp.2017.7953069

Cited by 84 publications

(64 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In place of an end of word symbol, we mark the beginning of each word with a capital letter. This idea is inspired by the unit set used for the speech recognition system in [17]. For example, the utterance 'i don't know' will be preprocessed to 'IDon'tKnow'.…”

Section: Crossword Unitsmentioning

confidence: 99%

Subword and Crossword Units for CTC Acoustic Models

Zenkel¹,

Sanabria²,

Metze³

et al. 2018

Interspeech 2018

View full text Add to dashboard Cite

This paper proposes a novel approach to create a unit set for CTC-based speech recognition systems. By using Byte-Pair Encoding we learn a unit set of arbitrary size on a given training text. In contrast to using characters or words as units, this allows us to find a good trade-off between the size of our unit set and the available training data. We investigate both crossword units, which may span multiple words, and subword units. By evaluating these unit sets with decoding methods using a separate language model, we are able to show improvements over a purely character-based unit set.

show abstract

Section: Crossword Unitsmentioning

confidence: 99%

Subword and Crossword Units for CTC Acoustic Models

Zenkel¹,

Sanabria²,

Metze³

et al. 2018

Interspeech 2018

View full text Add to dashboard Cite

show abstract

“…Meanwhile, some speaker-related knowledge has been integrated into the end-to-end system [21,22]. Alternative subword units have been studied to incorporate richer information [23][24][25]. Recurrent neural network language models (RNNLMs) trained on additional text data have been integrated into the attention-based systems during training or testing [26][27][28].…”

Section: Introductionmentioning

confidence: 99%

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition

Zhang

Woodland

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper proposes a novel automatic speech recognition (ASR) framework called Integrated Source-Channel and Attention (ISCA) that combines the advantages of traditional systems based on the noisy source-channel model (SC) and end-to-end style systems using attention-based sequence-to-sequence models. The traditional SC system framework includes hidden Markov models and connectionist temporal classification (CTC) based acoustic models, language models (LMs), and a decoding procedure based on a lexicon, whereas the end-to-end style attention-based system jointly models the whole process with a single model. By rescoring the hypotheses produced by traditional systems using end-to-end style systems based on an extended noisy source-channel model, ISCA allows structured knowledge to be easily incorporated via the SCbased model while exploiting the complementarity of the attentionbased model. Experiments on the AMI meeting corpus show that ISCA is able to give a relative word error rate reduction up to 21% over an individual system, and by 13% over an alternative method which also involves combining CTC and attention-based models.

show abstract

“…For the decoder network, we used a 2-layer LSTM with 300 cells. In addition to the standard decoder network, our proposed models additionally require extra parameters for gating layers in order to fuse Prior Models LF-MMI (Povey et al, 2016) CD phones N/A 9.6 19.3 CTC (Zweig et al, 2017) Char 53M 19.8 32.1 CTC (Sanabria and Metze, 2018) Char, BPE-{300,1k,10k} 26M 12.5 23.7 CTC (Audhkhasi et al, 2018) Word (Phone init.) N/A 14.6 23.6 Seq2Seq (Zeyer et al, 2018) BPE-10k 150M* 13.5 27.1 Seq2Seq (Palaskar and Metze, 2018) Word-10k N/A 23.0 37.2 Seq2Seq (Zeyer et al, 2018) BPE-1k 150M* 11.8 25.7 conversational-context embedding to the decoder network compared to baseline.…”

Section: Training and Decodingmentioning

confidence: 99%

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

Kim

Dalmia

Metze

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use text-based external word and/or sentence embeddings (i.e., fast-Text, BERT) within an end-to-end framework, yielding significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.

show abstract

Advances in all-neural speech recognition

Cited by 84 publications

References 20 publications

Subword and Crossword Units for CTC Acoustic Models

Subword and Crossword Units for CTC Acoustic Models

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

Contact Info

Product

Resources

About