2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953069
|View full text |Cite
|
Sign up to set email alerts
|

Advances in all-neural speech recognition

Abstract: This paper advances the design of CTC-based all-neural (or end-toend) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks.We evaluate our system on the commonly used NIST 2000 conversational telephony test set, and significantly exceed the previously published performance … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
64
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 84 publications
(64 citation statements)
references
References 20 publications
0
64
0
Order By: Relevance
“…In place of an end of word symbol, we mark the beginning of each word with a capital letter. This idea is inspired by the unit set used for the speech recognition system in [17]. For example, the utterance 'i don't know' will be preprocessed to 'IDon'tKnow'.…”
Section: Crossword Unitsmentioning
confidence: 99%
“…In place of an end of word symbol, we mark the beginning of each word with a capital letter. This idea is inspired by the unit set used for the speech recognition system in [17]. For example, the utterance 'i don't know' will be preprocessed to 'IDon'tKnow'.…”
Section: Crossword Unitsmentioning
confidence: 99%
“…Meanwhile, some speaker-related knowledge has been integrated into the end-to-end system [21,22]. Alternative subword units have been studied to incorporate richer information [23][24][25]. Recurrent neural network language models (RNNLMs) trained on additional text data have been integrated into the attention-based systems during training or testing [26][27][28].…”
Section: Introductionmentioning
confidence: 99%
“…For the decoder network, we used a 2-layer LSTM with 300 cells. In addition to the standard decoder network, our proposed models additionally require extra parameters for gating layers in order to fuse Prior Models LF-MMI (Povey et al, 2016) CD phones N/A 9.6 19.3 CTC (Zweig et al, 2017) Char 53M 19.8 32.1 CTC (Sanabria and Metze, 2018) Char, BPE-{300,1k,10k} 26M 12.5 23.7 CTC (Audhkhasi et al, 2018) Word (Phone init.) N/A 14.6 23.6 Seq2Seq (Zeyer et al, 2018) BPE-10k 150M* 13.5 27.1 Seq2Seq (Palaskar and Metze, 2018) Word-10k N/A 23.0 37.2 Seq2Seq (Zeyer et al, 2018) BPE-1k 150M* 11.8 25.7 conversational-context embedding to the decoder network compared to baseline.…”
Section: Training and Decodingmentioning
confidence: 99%