ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054260
|View full text |Cite
|
Sign up to set email alerts
|

Synchronous Transformers for end-to-end Speech Recognition

Abstract: For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these models difficult to be applied for online speech recognition. In this paper, we propose a model named synchronous transformer to address this problem, which can predict the output sequence chunk by chunk. Once a fixed-length chunk of the input sequence is processed by the en… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
34
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 60 publications
(35 citation statements)
references
References 18 publications
1
34
0
Order By: Relevance
“…Then, we use a convolution front end to down-sample the long acoustic features. In the convolution front end, following Dong et al (2018); Tian et al (2020), two 3×3 CNN layers with stride 2 are stacked for both time and frequency dimensions. Afterwards, in order to enable the acoustic encoder to attend by relative positions, the positional encoding is added to the output of the convolution front end.…”
Section: Acoustic Encodermentioning
confidence: 99%
See 1 more Smart Citation
“…Then, we use a convolution front end to down-sample the long acoustic features. In the convolution front end, following Dong et al (2018); Tian et al (2020), two 3×3 CNN layers with stride 2 are stacked for both time and frequency dimensions. Afterwards, in order to enable the acoustic encoder to attend by relative positions, the positional encoding is added to the output of the convolution front end.…”
Section: Acoustic Encodermentioning
confidence: 99%
“…In this work, we make the following efforts to advance multimodal NER: First, we construct a large-scale humanannotated Chinese NER dataset with Textual and Acoustic contents, named CNERTA. Specifically, we annotate all occurrences of 3 entity types (person name, location and organization) in 42,987 sentences originating from the transcripts of Aishell-1 (Bu et al, 2017), a corpus that has been widely employed in Mandarin speech recognition research in recent years (Shan et al, 2019;Tian et al, 2020). In particular, unlike previous multimodal NER datasets (Moon et al, 2018;Lu et al, 2018) are all flatly annotated, not only the topmost entities but also nested entities are annotated in CNERTA.…”
Section: Introductionmentioning
confidence: 99%
“…As the reception field grows linearly with the number of transformer layers, a large latency is introduced with the strategy. 2) chunk-wise method [27,15] segments the input into small chunks and operates speech recognition on each chunk. However, the accuracy drops as the relationship between different chunks are ignored.…”
Section: Introductionmentioning
confidence: 99%
“…The time and space complexity are both reduced to O(T ), and the within-chunk computation across time can be parallelized with GPUs. While there has been recent work [18,19,20,21,22] with similar ideas showing that such streaming Transformers achieve competitive performance compared with latency-controlled BiLSTMs [23] or non-streaming Transformers for ASR, it remains unclear how the streaming transformers work for shorter sequence modeling task like wake word detection.…”
Section: Introductionmentioning
confidence: 99%