ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053568
|View full text |Cite
|
Sign up to set email alerts
|

Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition

Abstract: Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-ofthe-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the selfattention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 14 publications
(13 citation statements)
references
References 15 publications
0
13
0
Order By: Relevance
“…The acoustic model is set up according to the paper [16], which contains both the Chinese syllable and the English phone as modeling units. We also apply the multi-graph decoding strategy [17], which contains code-switch and English 1 and the code-switch dataset as Table 2, and the English is trained by the transcript of librispeech.…”
Section: Setupmentioning
confidence: 99%
“…The acoustic model is set up according to the paper [16], which contains both the Chinese syllable and the English phone as modeling units. We also apply the multi-graph decoding strategy [17], which contains code-switch and English 1 and the code-switch dataset as Table 2, and the English is trained by the transcript of librispeech.…”
Section: Setupmentioning
confidence: 99%
“…First is the AISHELL-2 [18] dataset which contains 1000 hours of speech data from 1991 speakers. Second is a 10K hours multi-domain dataset [10]. We also augment the AISEHLL-1 and AISEHLL-2 training data with 2-fold speed perturbation [19] in the experiments.…”
Section: Datasetsmentioning
confidence: 99%
“…Recently, transformer architecture which has achieved its success in the natural language process (NLP) tasks has also been widely used in ASR systems [8,9], demonstrating its superior performance compared with the state-of-the-art models. Our previous work also proposed a variant of model architecture which combined DFSMN with self-attention networks (SAN), and further applied the memory augmenting method on the self-attention layer [10]. In summary, the performance improvement of ASR systems owes much to the dedicated hand-designed model architectures.…”
Section: Introductionmentioning
confidence: 99%
“…64 i-vectors were tested to be a good choice to provide diverse speaker information (Zhao et al, 2020), and applying on all encoder layers helps capture speaker knowledge from both low-level phonetic features and high-level global information. thermore, here we also compare our model with the first persistent memory model used in ASR (You et al, 2019), in which persistent memory vectors are randomly initialized and meant to capture general knowledge. Different from them, our model is to address the speaker mismatch issue.…”
Section: Adaptation For General Speakersmentioning
confidence: 99%