End-to-End Neural Transformer Based Spoken Language Understanding

Radfar, Martin; Athanasios, Mouchtaris,; Kunzmann, Siegfried

doi:10.48550/arxiv.2008.10984

Cited by 5 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…E2E ASR is implemented in ESPnet, where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer is similar to [16]. E2E ASR is optimized with hybrid CTC/attention losses [30] with label smoothing.…”

Section: Methodsmentioning

confidence: 99%

“…However, these NLU works [10,12,13] usually ignore ASR or require an off-the-shelf ASR during testing. A line of E2E SLU work does take speech as input, yet it frames slots as intents and therefore their SLU models are really designed for IC only [8,9,14,15,16]. Another line of E2E SLU work jointly predicts text and IC/SL from speech, yet it either requires large amounts of in-house data, or restricts the pretraining scheme to ASR subword prediction [7,17,18,19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Lai¹,

Chuang

Lee

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-toend (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Lai¹,

Chuang

Lee

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…E2E ASR is implemented in ESPnet [65], where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer architecture [60] is due to its empirical successes in [33] and concurrent SLU work [48]. The E2E ASR is trained with hybrid CTC/attention loss [64] (CTC weight is 0.3, attention weight is 0.7) with label smoothing.…”

Section: Methodsmentioning

confidence: 99%

Towards Semi-Supervised Semantics Understanding from Speech

Lai¹,

Cao²,

Bodapati³

et al. 2020

Preprint

View full text Add to dashboard Cite

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of inhouse data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F 1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available. * Work performed during an internship at Amazon AI. † Corresponding author. 3 SLU typically consists of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). ASR maps audio to text, and NLU maps text to semantics. Here, we are interested in learning a mapping directly from raw audio to semantics. 4 Semantics is commonly formulated as intent and slots in common benchmarking datasets like ATIS.

show abstract

“…A common practice is to convert normalized token sequence in spoken form produced by ASR into a written form better suited to processing by downstream components in dialog systems [15]. This written form is then used to extract structured information in the form of intent and slot-values to continue a dialog [16].…”

Section: Related Workmentioning

confidence: 99%

Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Singla¹,

Jalalvand²,

Kim³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

This paper reimagines some aspects of speech processing using speech encoders, specifically about extracting entities directly from speech, with no intermediate textual representation. In human-computer conversations, extracting entities such as names, postal addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech, ignoring the superfluous portions such as carrier phrases and spellings of entities. In the context of dialogs from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step cascade of first generating lexical transcriptions followed by text-based entity extraction for identifying spoken entities.

show abstract

End-to-End Neural Transformer Based Spoken Language Understanding

Cited by 5 publications

References 0 publications

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Towards Semi-Supervised Semantics Understanding from Speech

Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Contact Info

Product

Resources

About