Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-546
|View full text |Cite
|
Sign up to set email alerts
|

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
133
3

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 102 publications
(139 citation statements)
references
References 22 publications
3
133
3
Order By: Relevance
“…4 we compare three sampling strategies: 1) Sample from each domain in Tab. 1 with equal probability (Uniform-Domain); 2) Further divide each domain into subdomains 6 , and sample from each subdomain with equal probability (Uniform-Subdomain); 3) Sample from each domain with probability proportional to the total number of utterances in the domain (Count-Weighted). As can be seen in the table, for E2E ASR models, we find that, contrary to [31], the best strategy is to sample utterances proportional to the amount of training data in each domain.…”
Section: Multidomain Training: Impact Of Data Diversitymentioning
confidence: 99%
“…4 we compare three sampling strategies: 1) Sample from each domain in Tab. 1 with equal probability (Uniform-Domain); 2) Further divide each domain into subdomains 6 , and sample from each subdomain with equal probability (Uniform-Subdomain); 3) Sample from each domain with probability proportional to the total number of utterances in the domain (Count-Weighted). As can be seen in the table, for E2E ASR models, we find that, contrary to [31], the best strategy is to sample utterances proportional to the amount of training data in each domain.…”
Section: Multidomain Training: Impact Of Data Diversitymentioning
confidence: 99%
“…In [11], it was shown that using a pre-training strategy improves the generalization capability and hence performance of models in several seq2seq problems such as machine translation and abstractive summarization. Initializing a word-based CTC model with a pre-trained phone-based CTC model was found to be useful in [12]. Similarly, multi-task learning on hierarchical models has also been found to be effective, as in [13].…”
Section: Introductionmentioning
confidence: 97%
“…Prior studies on speaker adaptation of E2E systems include appending i-vectors to the acoustic features [8], using speakertransformed features obtained by feature space maximum likelihood linear regression (fMLLR) [9], using GMM-derived features [10], or using a speaker adversarial network [11]. Most of these methods apply adaptation only to the input features.…”
Section: Introductionmentioning
confidence: 99%