Synchronous Transformers for end-to-end Speech Recognition

Tian, Zhengkun; Yi, Jiangyan; Bai, Ye; Tao, Jianhua; Zhang, Shuai; Wen, Zhengqi

doi:10.1109/icassp40776.2020.9054260

Cited by 60 publications

(35 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, we use a convolution front end to down-sample the long acoustic features. In the convolution front end, following Dong et al (2018); Tian et al (2020), two 3×3 CNN layers with stride 2 are stacked for both time and frequency dimensions. Afterwards, in order to enable the acoustic encoder to attend by relative positions, the positional encoding is added to the output of the convolution front end.…”

Section: Acoustic Encodermentioning

confidence: 99%

“…In this work, we make the following efforts to advance multimodal NER: First, we construct a large-scale humanannotated Chinese NER dataset with Textual and Acoustic contents, named CNERTA. Specifically, we annotate all occurrences of 3 entity types (person name, location and organization) in 42,987 sentences originating from the transcripts of Aishell-1 (Bu et al, 2017), a corpus that has been widely employed in Mandarin speech recognition research in recent years (Shan et al, 2019;Tian et al, 2020). In particular, unlike previous multimodal NER datasets (Moon et al, 2018;Lu et al, 2018) are all flatly annotated, not only the topmost entities but also nested entities are annotated in CNERTA.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

Sui¹,

Tian²,

Chen³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

In this paper, we aim to explore an uncharted territory, which is Chinese multimodal named entity recognition (NER) with both textual and acoustic contents. To achieve this, we construct a large-scale human-annotated Chinese multimodal NER dataset, named CNERTA. Our corpus totally contains 42,987 annotated sentences accompanying by 71 hours of speech data. Based on this dataset, we propose a family of strong and representative baseline models, which can leverage textual features or multimodal features. Upon these baselines, to capture the natural monotonic alignment between the textual modality and the acoustic modality, we further propose a simple multimodal multitask model by introducing a speech-to-text alignment auxiliary task.Through extensive experiments, we observe that: (1) Progressive performance boosts as we move from unimodal to multimodal, verifying the necessity of integrating speech clues into Chinese NER. (2) Our proposed model yields state-of-the-art (SoTA) results on CNERTA, demonstrating its effectiveness. For further research, the annotated dataset is publicly available at http://github.com/DianboWork/ CNERTA.

show abstract

Section: Acoustic Encodermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

Sui¹,

Tian²,

Chen³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…As the reception field grows linearly with the number of transformer layers, a large latency is introduced with the strategy. 2) chunk-wise method [27,15] segments the input into small chunks and operates speech recognition on each chunk. However, the accuracy drops as the relationship between different chunks are ignored.…”

Section: Introductionmentioning

confidence: 99%

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Xie

Wang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

129

View full text Add to dashboard Cite

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

show abstract

“…The time and space complexity are both reduced to O(T ), and the within-chunk computation across time can be parallelized with GPUs. While there has been recent work [18,19,20,21,22] with similar ideas showing that such streaming Transformers achieve competitive performance compared with latency-controlled BiLSTMs [23] or non-streaming Transformers for ASR, it remains unclear how the streaming transformers work for shorter sequence modeling task like wake word detection.…”

Section: Introductionmentioning

confidence: 99%

Wake Word Detection with Streaming Transformers

Wang

Povey

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

show abstract

Synchronous Transformers for end-to-end Speech Recognition

Cited by 60 publications

References 18 publications

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Wake Word Detection with Streaming Transformers

Contact Info

Product

Resources

About