Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1778
|View full text |Cite
|
Sign up to set email alerts
|

Semantic Mask for Transformer Based End-to-End Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
35
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 28 publications
(39 citation statements)
references
References 0 publications
1
35
0
Order By: Relevance
“…The design of the Frame-based Masked Language Model task is inspired by the Masked Language Model (MLM) objective of BERT (Devlin et al, 2019) and semantic mask for ASR task (Wang et al, 2019a). This task enables the encoder to understand the inner meaning of a segment of speech.…”
Section: Frame-based Masked Language Modelmentioning
confidence: 99%
“…The design of the Frame-based Masked Language Model task is inspired by the Masked Language Model (MLM) objective of BERT (Devlin et al, 2019) and semantic mask for ASR task (Wang et al, 2019a). This task enables the encoder to understand the inner meaning of a segment of speech.…”
Section: Frame-based Masked Language Modelmentioning
confidence: 99%
“…As with the utterance-wise evaluation, the Early Exit Transformer mod-Table 1: Utterance-wise evaluation. Two numbers in a cell denote %WER of the hybrid SR model used in LibriCSS [18] and end-to-end transformer based SR model [16]. 0S: 0% overlap with short inter-utterance silence.…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…One is a hybrid system with a BLSTM based acoustic model and a 4-gram language model as used in the original LibriCSS paper [18]. The other is one of the best open-source endto-end transformer [16] based ASR models 1 which achieves WERs of 2.08% and 4.95% for LibriSpeech test-clean and test-other, respectively. As with [18], by leveraging the multiple microphones, the individual target signals are generated with mask-based adaptive minimum variance distortionless response (MVDR) beamforming.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…The proposed approach, referred to as Top-Down SLU (TD-SLU), is based on Top-down Attention, a technique inspired from Cognitive Science [8,9] and already successfully applied in the computer vision [10,11] domain. More specifically, on top of the traditional transformer Sequence-to-Sequence (s2s) model [12] used in recent E2E SLU models [13,14], we introduce an additional transformer encoder for the NLU task, which allows the model to utilize the entire sequence to gain semantic understanding by attention. In addition, following the multimodal literature, we combine the top-level NLU feature with the low-level acoustic features using an attention gate [15], which generates a shift in the low-level representation, adapting it according to the high-level information, thus improving the performance.…”
Section: Introductionmentioning
confidence: 99%