History Utterance Embedding Transformer LM for Speech Recognition

Deng, Keqi; Cheng, Gaofeng; Miao, Haoran; Zhang, Pengyuan; Yan, Yonghong

doi:10.1109/icassp39728.2021.9414575

Cited by 5 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This method is also used in GPT2 model compression. Language models are widely used in ASR task [ 46 ]. Combining LM with an end-to-end ASR model is common through shallow fusion [ 47 ] or cold fusion [ 48 ].…”

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren

Yolwas

Slamu

et al. 2022

Sensors

View full text Add to dashboard Cite

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren

Yolwas

Slamu

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…LM plays an important part in ASR [23]. Previous works like shallow fusion [24] and cold fusion [25] aim to combine an auto-regressive LM with a S2S ASR model, which is randomly initialized.…”

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Deng¹,

Cao²,

Zhang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a 4.6% character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields 27% relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

show abstract

“…To the best of our knowledge, for Mandarin Chinese, there is no public-available dialog speech dataset adequate for the current requirement of high quality. With the boom in popularity of voice-driven interfaces to devices recently, some works [22,23] concerned with communication scenes have been conducted. However, exploring speech processing techniques in dialog scenarios is still challenging.…”

Section: Introductionmentioning

confidence: 99%

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Zhang¹,

Chen²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speechrelated tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant tasks and provide experimental results to help evaluate the dataset.

show abstract

History Utterance Embedding Transformer LM for Speech Recognition

Cited by 5 publications

References 22 publications

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Contact Info

Product

Resources

About