ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054689
|View full text |Cite
|
Sign up to set email alerts
|

Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding

Abstract: Employing pre-trained language models (LM) to extract contextualized word representations has achieved state-of-the-art performance on various NLP tasks. However, applying this technique to noisy transcripts generated by automatic speech recognizer (ASR) is concerned. Therefore, this paper focuses on making contextualized representations more ASRrobust. We propose a novel confusion-aware fine-tuning method to mitigate the impact of ASR errors to pre-trained LMs. Specifically, we fine-tune LMs to produce simila… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 27 publications
(37 citation statements)
references
References 21 publications
0
37
0
Order By: Relevance
“…For a fair comparison, we exclude the model performance of a data augmentation setting. Note that the experiment settings for Smartlights and Snips datasets are not exactly same as that of Huang and Chen [28] due to the given data condition described in 4.1. Nevertheless, we present both results to show that our model is competitive.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…For a fair comparison, we exclude the model performance of a data augmentation setting. Note that the experiment settings for Smartlights and Snips datasets are not exactly same as that of Huang and Chen [28] due to the given data condition described in 4.1. Nevertheless, we present both results to show that our model is competitive.…”
Section: Resultsmentioning
confidence: 99%
“…Previous SLU studies used this dataset by synthesizing audio from text data, and we follow the same experimental protocol. To compare our result with the result of Huang and Chen, we used Google text-to-speech (TTS) system the same as in [28], though it does not guarantee that our data are identical to their audio. SmartLights dataset [29] consists of 1,660 spoken commands for a smart light assistant with 6 unique intents.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…SNIPS is an NLU benchmark, so they only provide text utterances and their corresponding labels. We generate speech data by Google's commercial speech synthesis toolkit 1 similar to [22] to use SNIPS for SLU evaluation. We use a single speaker option by setting as a basic voice type named en-US-Standard-B.…”
Section: Datasetmentioning
confidence: 99%
“…We use a single speaker option by setting as a basic voice type named en-US-Standard-B. Because other works [22,23] use their own speech synthesis methods and does not mention exact details to reproduce, a fair comparison between them and ours is impossible.…”
Section: Datasetmentioning
confidence: 99%