Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-233
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Sequence-to-Sequence Models for Speech Recognition

Abstract: In this work, we conduct a detailed evaluation of various allneural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attentionbased model, and a model which au… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
220
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 274 publications
(224 citation statements)
references
References 15 publications
4
220
0
Order By: Relevance
“…In these approaches, we first obtain the Mel filterbank coefficients. The estimated parameters using the power function-based MUD using (8) are surprisingly close to the power coefficient of 1 15 which we obtained by modeling the rate-intensity curve using a human auditory system [27,28]. The histogram-based MUD shows comparable performance to the conventional MFCC processing, but it was worse than the performacne of the power function-based MUD.…”
Section: Discussionsupporting
confidence: 67%
See 2 more Smart Citations
“…In these approaches, we first obtain the Mel filterbank coefficients. The estimated parameters using the power function-based MUD using (8) are surprisingly close to the power coefficient of 1 15 which we obtained by modeling the rate-intensity curve using a human auditory system [27,28]. The histogram-based MUD shows comparable performance to the conventional MFCC processing, but it was worse than the performacne of the power function-based MUD.…”
Section: Discussionsupporting
confidence: 67%
“…The performance difference between the power-law nonlinearity of (·) 1 15 and the power function-based MUD is usually very small. This was expected since the estimated parameters using (8) are not very different from 1…”
Section: Resultsmentioning
confidence: 79%
See 1 more Smart Citation
“…Although our goal was not to design a speech processing model that can compete with those used in the domain of automatic speech recognition (Li et al, 2014;Prabhavalkar et al, 2017;Sak et al, 2017), it turns out that the notion of neural oscillations could be relevant for the latter. Hyafil and Cernak (Hyafil and Cernak, 2015) demonstrated that a biophysically plausible theta oscillator which can syllabify speech on-line in a flexible manner makes a speech recognition system more resilient to noise and to variable speech rates.…”
Section: Discussionmentioning
confidence: 99%
“…Recently, End-to-end (E2E) neural network architectures based on sequence to sequence (seq2seq) learning for automatic speech recognition (ASR) have been gaining lots of attention [1,2], mainly because they can learn both the acoustic and the linguistic information, as well as the alignments between them, all simultaneously unlike the conventional ASR systems which were based on the hybrid models of hidden Markov models (HMMs) and deep neural network (DNN) models. Moreover, the E2E models are more suitable to be compressed since they do not need separate phonetic dictionaries and language models, making them one of the best candidates for on-device ASR systems.…”
Section: Introductionmentioning
confidence: 99%