Joint Contextual Modeling for ASR Correction and Language Understanding

Weng, Yue; Miryala, Sai Sumanth; Khatri, Chandra; Wang, Runze; Zheng, Huaixiu; Molino, Piero; Namazifar, Mahdi; Papangelis, Alexandros; Williams, Hugh E.; Bell, Franziska; Tür, Gökhan

doi:10.1109/icassp40776.2020.9053213

Cited by 42 publications

(31 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, there has been work on re-scoring ASR nbest by exploring the morphological, lexical, and syntactic features [21,22]. [23] shows a joint model for ASR error correction and language understanding tasks such as dialog act prediction and slot filling. Our work differs in that we are not attempting to correct the ASR error by re-aligning the hypotheses but considering the n-best directly for downstream NLU tasks jointly..…”

Section: Effect Of Multiple Cnnsmentioning

confidence: 99%

ASR N-Best Fusion Nets

Chen

Wanigasekara

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Current spoken language understanding systems heavily rely on the best hypothesis (ASR 1-best) generated by automatic speech recognition, which is used as the input for downstream models such as natural language understanding (NLU) modules. However, the potential errors and misrecognition in ASR 1-best raise challenges to NLU. It is usually difficult for NLU models to recover from ASR errors without additional signals, which leads to suboptimal SLU performance. This paper proposes a fusion network to jointly consider ASR n-best hypotheses for enhanced robustness to ASR errors. Our experiments on Alexa data show that our model achieved 21.71% error reduction compared to baseline trained on transcription for domain classification.

show abstract

Section: Effect Of Multiple Cnnsmentioning

confidence: 99%

ASR N-Best Fusion Nets

Chen

Wanigasekara

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The inverse transformation of correcting ASR errors has also been explored, for example in Weng et al (2020).…”

Section: Related Workmentioning

confidence: 99%

Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model

Thulke¹,

Nico²,

Dugast³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper summarizes our submission to Task 2 of the second track of the 10th Dialog System Technology Challenge (DSTC10) "Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations". Similar to the previous year's iteration, the task consists of three subtasks: detecting whether a turn is knowledge seeking, selecting the relevant knowledge document and finally generating a grounded response. This year, the focus lies on adapting the system to noisy ASR transcripts. We explore different approaches to make the models more robust to this type of input and to adapt the generated responses to the style of spoken conversations. For the latter, we get the best results with a noisy channel model that additionally reduces the number of short and generic responses. Our best system achieved the 1st rank in the automatic and the 3rd rank in the human evaluation of the challenge.

show abstract

“…In AED, DB can also be applied by combining a biasing vector with the decoder output before passing them to the Softmax output layer, as shown in Eqn. (12).…”

Section: Joint Networkmentioning

confidence: 99%

“…Contextual biasing, which integrates contextual knowledge into an automatic speech recognition (ASR) system, has become increasingly important to many applications [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. Contextual knowledge is often represented by a list (referred to as a biasing list) of words or phrases (referred to as biasing words) that are likely to appear in an utterance in a given context.…”

Section: Introductionmentioning

confidence: 99%

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

Sun

Zhang

Woodland

2021

Preprint

View full text Add to dashboard Cite

Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attentionbased encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut between the tree and the final ASR output distribution to facilitate recognising biasing words during decoding. Systems were trained and evaluated on the Librispeech corpus where biasing words were extracted at the scales of an utterance, a chapter, or a book to simulate different application scenarios. Experimental results showed that TCPGen consistently improved word error rates (WERs) compared to the baselines, and in particular, achieved significant WER reductions on the biasing words. TCPGen is highly efficient: it can handle 5,000 biasing words and distractors and only add a small overhead to memory use and computation cost.

show abstract

Joint Contextual Modeling for ASR Correction and Language Understanding

Cited by 42 publications

References 21 publications

ASR N-Best Fusion Nets

ASR N-Best Fusion Nets

Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

Contact Info

Product

Resources

About