2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461785
|View full text |Cite
|
Sign up to set email alerts
|

Towards End-to-end Spoken Language Understanding

Abstract: Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for downstreaming consumers, such as dialog system, hands-free applications. These components are usually developed and optimized independently. In this paper, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
199
0
4

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 190 publications
(203 citation statements)
references
References 21 publications
0
199
0
4
Order By: Relevance
“…Another learning problem may be introduced by the very different length between input sequence (speech spectrograms) and gold output sequences (characters, tokens or concepts). Let the input sequence have length N and the output sequence have length M. In general N M. 2 When the decoder is at processing step i, it has no information on which spectrogram frames to use as input. This problem can be solved using an attention mechanism [18] to 2 In our data we found that N M ≤30 Table 2.…”
Section: Incremental Training Strategymentioning
confidence: 99%
See 2 more Smart Citations
“…Another learning problem may be introduced by the very different length between input sequence (speech spectrograms) and gold output sequences (characters, tokens or concepts). Let the input sequence have length N and the output sequence have length M. In general N M. 2 When the decoder is at processing step i, it has no information on which spectrogram frames to use as input. This problem can be solved using an attention mechanism [18] to 2 In our data we found that N M ≤30 Table 2.…”
Section: Incremental Training Strategymentioning
confidence: 99%
“…Let the input sequence have length N and the output sequence have length M. In general N M. 2 When the decoder is at processing step i, it has no information on which spectrogram frames to use as input. This problem can be solved using an attention mechanism [18] to 2 In our data we found that N M ≤30 Table 2. ASR Results on MEDIA -(*) is a character error rate focus on the correct part of the input sequence depending on the part of the output sequence being decoded.…”
Section: Incremental Training Strategymentioning
confidence: 99%
See 1 more Smart Citation
“…Nowadays there is a growing research interest in end-to-end systems for various SLU tasks [23][24][25][26][27][28][29][30][31]. In this work, similarly to [26,29], end-to-end training of signal-to-concept models is performed through the recurrent neural network (RNN) architecture and the connectionist temporal classification (CTC) loss function [32] as shown in Figure 1.…”
Section: End-to-end Signal-to-concept Neural Architecturementioning
confidence: 99%
“…Spoken Language Understanding(SLU) systems are core components of voice agents such as Apple's Siri, Amazon Alexa and Google Assistant and can be designed in one of the several ways, such as an end to end modeling scheme [1], or a collection of task specific classifiers [2,3]. For a complex SLU system, the machine learning architecture can be computationally expensive, posing a challenge for applications such as On-Device-SLU.…”
Section: Introductionmentioning
confidence: 99%