ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683380
|View full text |Cite
|
Sign up to set email alerts
|

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Abstract: In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of me… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 30 publications
0
6
0
Order By: Relevance
“…For all augmentation schemes, we sweep across the rate of dropout in range [0.0, 0.7] to find the optimal level of total regularization. For the baseline, the best result comes from setting it to 0.7 3 , those trained with data augmentation were fairly robust to the dropout rate and achieved their best performance in range of 0.3 -0.6.…”
Section: Tested Augmentation Schemesmentioning
confidence: 99%
See 1 more Smart Citation
“…For all augmentation schemes, we sweep across the rate of dropout in range [0.0, 0.7] to find the optimal level of total regularization. For the baseline, the best result comes from setting it to 0.7 3 , those trained with data augmentation were fairly robust to the dropout rate and achieved their best performance in range of 0.3 -0.6.…”
Section: Tested Augmentation Schemesmentioning
confidence: 99%
“…The traditional reason language models (LMs) appear in ASR systems is that they directly represent the prior term P (S) in the Bayes factorization of the posterior probability P (S|A) of a sentence S given the audio A. However in practice, LMs trained on excessive amounts of data are combined with hybrid and end-to-end systems alike [1,2,3] at authors liberty. Overall, LMs can be seen as a refinement tool to apply on a preliminary result of recognition.…”
Section: Introductionmentioning
confidence: 99%
“…Cho et al [9] presents a technique (Cell Control Fusion) that is similar to [8], but differs in the aspect of not just fusing the gated outputs of hidden states of the external language model (RNN) but also for the cell states. Hence, a LSTM flavor of the RNN is used as a sequence-to-sequence model in this technique.…”
Section: Sriram Et Al[8] Presents a Technique (Cold Fusionmentioning
confidence: 99%
“…Unimodal and multimodal model fusion has been explored extensively in the context of ASR [29,7], Neural Machine Translation (NMT) [12], and hierarchical story generation [11]. However, to the best of our knowledge, there has been no similar works for visual captioning.…”
Section: Fusion Techniques and Variationsmentioning
confidence: 99%