ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053324
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-to-Sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

Abstract: In this paper, we investigate the benefit that off-the-shelf word embedding can bring to the sequence-to-sequence (seqto-seq) automatic speech recognition (ASR). We first introduced the word embedding regularization by maximizing the cosine similarity between a transformed decoder feature and the target word embedding. Based on the regularized decoder, we further proposed the fused decoding mechanism. This allows the decoder to consider the semantic consistency during decoding by absorbing the information carr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…Other recent studies have sought to improve E2E ASR with word embedding learned from text-only data. In [43], the researchers chose to adopt word embedding because off-the-shelf word embedding carrying semantic information learned from a vast amount of text can be easily obtained. An autoregressive decoder was generally used to predict the transcription corresponding to the input speech.…”
Section: ) Vocabularymentioning
confidence: 99%
“…Other recent studies have sought to improve E2E ASR with word embedding learned from text-only data. In [43], the researchers chose to adopt word embedding because off-the-shelf word embedding carrying semantic information learned from a vast amount of text can be easily obtained. An autoregressive decoder was generally used to predict the transcription corresponding to the input speech.…”
Section: ) Vocabularymentioning
confidence: 99%
“…The ASR sub-model is based on a hybrid connectionist temporal classification (CTC)/attention architecture [31] and is inspired by prior work including that in [31], [34], and [35]. To train the ASR task, we used the Librispeech dataset [36], which is an English dataset comprising over 1000 hours of read speech.…”
Section: Transfer Learning and Comparison To Prior Workmentioning
confidence: 99%