Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding

Huang, Chao‐Wei; Chen, Yun-Nung

doi:10.1109/icassp40776.2020.9054689

Cited by 27 publications

(37 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For a fair comparison, we exclude the model performance of a data augmentation setting. Note that the experiment settings for Smartlights and Snips datasets are not exactly same as that of Huang and Chen [28] due to the given data condition described in 4.1. Nevertheless, we present both results to show that our model is competitive.…”

Section: Resultsmentioning

confidence: 99%

“…Previous SLU studies used this dataset by synthesizing audio from text data, and we follow the same experimental protocol. To compare our result with the result of Huang and Chen, we used Google text-to-speech (TTS) system the same as in [28], though it does not guarantee that our data are identical to their audio. SmartLights dataset [29] consists of 1,660 spoken commands for a smart light assistant with 6 unique intents.…”

Section: Methodsmentioning

confidence: 99%

“…For each utterance, there are two audio types where the microphone setting is different, close field and far field and the latter setting is more challenging. To evaluate our model on this dataset, we use 10-fold cross-validation as suggested in [28].…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Kim¹,

Kim²,

Lee³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…SNIPS is an NLU benchmark, so they only provide text utterances and their corresponding labels. We generate speech data by Google's commercial speech synthesis toolkit 1 similar to [22] to use SNIPS for SLU evaluation. We use a single speaker option by setting as a basic voice type named en-US-Standard-B.…”

Section: Datasetmentioning

confidence: 99%

“…We use a single speaker option by setting as a basic voice type named en-US-Standard-B. Because other works [22,23] use their own speech synthesis methods and does not mention exact details to reproduce, a fair comparison between them and ours is impossible.…”

Section: Datasetmentioning

confidence: 99%

Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Kim

Shin

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or finetuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterancelevel representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.

show abstract

DDR-ECC: Dictionary-Driven Chinese ASR Entity Correction with Controllable Decoding

Wang

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding

Cited by 27 publications

References 21 publications

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

DDR-ECC: Dictionary-Driven Chinese ASR Entity Correction with Controllable Decoding

Contact Info

Product

Resources

About