Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Audhkhasi, Kartik; Ramabhadran, Bhuvana; Saon, George; Picheny, Michael; Nahamoo, D.

doi:10.21437/interspeech.2017-546

Cited by 102 publications

(139 citation statements)

References 22 publications

Supporting

Mentioning

133

Contrasting

Order By: Relevance

“…4 we compare three sampling strategies: 1) Sample from each domain in Tab. 1 with equal probability (Uniform-Domain); 2) Further divide each domain into subdomains 6 , and sample from each subdomain with equal probability (Uniform-Subdomain); 3) Sample from each domain with probability proportional to the total number of utterances in the domain (Count-Weighted). As can be seen in the table, for E2E ASR models, we find that, contrary to [31], the best strategy is to sample utterances proportional to the amount of training data in each domain.…”

Section: Multidomain Training: Impact Of Data Diversitymentioning

confidence: 99%

Recognizing Long-Form Speech Using Streaming End-to-End Models

Narayanan

Prabhavalkar

Chiu

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

109

View full text Add to dashboard Cite

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized longform test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 27% relative.

show abstract

Section: Multidomain Training: Impact Of Data Diversitymentioning

confidence: 99%

Recognizing Long-Form Speech Using Streaming End-to-End Models

Narayanan

Prabhavalkar

Chiu

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

109

View full text Add to dashboard Cite

show abstract

“…In [11], it was shown that using a pre-training strategy improves the generalization capability and hence performance of models in several seq2seq problems such as machine translation and abstractive summarization. Initializing a word-based CTC model with a pre-trained phone-based CTC model was found to be useful in [12]. Similarly, multi-task learning on hierarchical models has also been found to be effective, as in [13].…”

Section: Introductionmentioning

confidence: 97%

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

Garg

Gowda

Kumar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attentionbased encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic granularity namely, character and BPE, is used. We explore different pre-training strategies for the encoders including transfer learning from a bidirectional encoder. Our encoder-decoder models with online attention show ∼35% and ∼10% relative improvement over their baselines for smaller and bigger models, respectively. Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively after fusion with long short-term memory (LSTM) based external language model (LM).

show abstract

“…Prior studies on speaker adaptation of E2E systems include appending i-vectors to the acoustic features [8], using speakertransformed features obtained by feature space maximum likelihood linear regression (fMLLR) [9], using GMM-derived features [10], or using a speaker adversarial network [11]. Most of these methods apply adaptation only to the input features.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Sarı

Moritz

Hori

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoderdecoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

show abstract

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Cited by 102 publications

References 22 publications

Recognizing Long-Form Speech Using Streaming End-to-End Models

Recognizing Long-Form Speech Using Streaming End-to-End Models

Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Contact Info

Product

Resources

About