Towards End-to-end Spoken Language Understanding

Serdyuk, Dmitriy; Wang, Yongqiang; Fuegen, Christian; Kumar, Anuj; Liu, Baiyang; Bengio, Yoshua

doi:10.1109/icassp.2018.8461785

Cited by 190 publications

(203 citation statements)

References 21 publications

Supporting

Mentioning

199

Contrasting

Unclassified

Order By: Relevance

“…Another learning problem may be introduced by the very different length between input sequence (speech spectrograms) and gold output sequences (characters, tokens or concepts). Let the input sequence have length N and the output sequence have length M. In general N M. 2 When the decoder is at processing step i, it has no information on which spectrogram frames to use as input. This problem can be solved using an attention mechanism [18] to 2 In our data we found that N M ≤30 Table 2.…”

Section: Incremental Training Strategymentioning

confidence: 99%

“…Let the input sequence have length N and the output sequence have length M. In general N M. 2 When the decoder is at processing step i, it has no information on which spectrogram frames to use as input. This problem can be solved using an attention mechanism [18] to 2 In our data we found that N M ≤30 Table 2. ASR Results on MEDIA -(*) is a character error rate focus on the correct part of the input sequence depending on the part of the output sequence being decoded.…”

Section: Incremental Training Strategymentioning

confidence: 99%

“…Most of recently proposed end-to-end models are based on sequence-tosequence architectures. They were initially applied to speech translation [6,7] and then to SLU tasks where the main goal is to extract the domain and user intent from an utterance, together with some semantic slots [2,5].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Data Efficient End-to-End Spoken Language Understanding Architecture

Dinarelli

Kapoor

Jabaian

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end architectures have been recently proposed for spoken language understanding (SLU) and semantic parsing. Based on a large amount of data, those models learn jointly acoustic and linguistic-sequential features. Such architectures give very good results in the context of domain, intent and slot detection, their application in a more complex semantic chunking and tagging task is less easy. For that, in many cases, models are combined with an external a language model to enhance their performance.In this paper we introduce a data efficient system which is trained end-to-end, with no additional, pre-trained external module. One key feature of our approach is an incremental training procedure where acoustic, language and semantic models are trained sequentially one after the other. The proposed model has a reasonable size and achieves competitive results with respect to state-of-the-art while using a small training dataset. In particular, we reach 24.02% Concept Error Rate (CER) on MEDIA/test while training on MEDIA/train without any additional data.

show abstract

Section: Incremental Training Strategymentioning

confidence: 99%

Section: Incremental Training Strategymentioning

confidence: 99%

See 1 more Smart Citation

A Data Efficient End-to-End Spoken Language Understanding Architecture

Dinarelli

Kapoor

Jabaian

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Nowadays there is a growing research interest in end-to-end systems for various SLU tasks [23][24][25][26][27][28][29][30][31]. In this work, similarly to [26,29], end-to-end training of signal-to-concept models is performed through the recurrent neural network (RNN) architecture and the connectionist temporal classification (CTC) loss function [32] as shown in Figure 1.…”

Section: End-to-end Signal-to-concept Neural Architecturementioning

confidence: 99%

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

Tomashenko

Raymond

Caubrière

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This work investigates the embeddings for representing dialog history in spoken language understanding (SLU) systems. We focus on the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. We proposed to integrate dialogue history into an endto-end signal-to-concept SLU system. The dialog history is represented in the form of dialog history embedding vectors (so-called h-vectors) and is provided as an additional information to end-toend SLU models in order to improve the system performance. Three following types of h-vectors are proposed and experimentally evaluated in this paper: (1) supervised-all embeddings predicting bagof-concepts expected in the answer of the user from the last dialog system response; (2) supervised-freq embeddings focusing on predicting only a selected set of semantic concept (corresponding to the most frequent errors in our experiments); and (3) unsupervised embeddings. Experiments on the MEDIA corpus for the semantic slot filling task demonstrate that the proposed h-vectors improve the model performance.Index Terms-End-to-end models, spoken language understanding (SLU), dialog history, h-vectors, semantic slot filling (SF)

show abstract

“…Spoken Language Understanding(SLU) systems are core components of voice agents such as Apple's Siri, Amazon Alexa and Google Assistant and can be designed in one of the several ways, such as an end to end modeling scheme [1], or a collection of task specific classifiers [2,3]. For a complex SLU system, the machine learning architecture can be computationally expensive, posing a challenge for applications such as On-Device-SLU.…”

Section: Introductionmentioning

confidence: 99%

Fast Intent Classification for Spoken Language Understanding Systems

Tyagi

Sharma

Gupta

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Spoken Language Understanding (SLU) systems consist of several machine learning components operating together (e.g. intent classification, named entity recognition and resolution). Deep learning models have obtained state of the art results on several of these tasks, largely attributed to their better modeling capacity. However, an increase in modeling capacity comes with added costs of higher latency and energy usage, particularly when operating on low complexity devices. To address the latency and computational complexity issues, we explore a BranchyNet scheme on an intent classification task within SLU systems. The BranchyNet scheme when applied to a high complexity model, adds exit points at various stages in the model allowing early decision making for a set of queries to the SLU model. We conduct experiments on the Facebook Semantic Parsing dataset with two candidate model architectures for intent classification. Our experiments show that the BranchyNet scheme provides gains in terms of computational complexity without compromising model accuracy. We also conduct analytical studies regarding the improvements in the computational cost, distribution of utterances that egress from various exit points and the impact of adding complexity to inference speed and quality.

show abstract

Towards End-to-end Spoken Language Understanding

Cited by 190 publications

References 21 publications

A Data Efficient End-to-End Spoken Language Understanding Architecture

A Data Efficient End-to-End Spoken Language Understanding Architecture

Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems

Fast Intent Classification for Spoken Language Understanding Systems

Contact Info

Product

Resources

About