ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054314
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Architectures for ASR-Free Spoken Language Understanding

Abstract: Spoken Language Understanding (SLU) is the problem of extracting the meaning from speech utterances. It is typically addressed as a two-step problem, where an Automatic Speech Recognition (ASR) model is employed to convert speech into text, followed by a Natural Language Understanding (NLU) model to extract meaning from the decoded text. Recently, end-to-end approaches were emerged, aiming at unifying the ASR and NLU into a single SLU deep neural architecture, trained using combinations of ASR and NLU-level re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
13
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(14 citation statements)
references
References 19 publications
1
13
0
Order By: Relevance
“…[22] has a comparable performance to ours (1.9% vs 2.18%) similar to the work of Palogiannidi et al [23], which has performances between 1.38% and 5.83% depending on the number and the types of the recurrent layers. Our approach is more appropriate for low-power devices since it is built from only convolutional layers, where the models of [6,22,23] includes many recurrent layers, which are slower compared to convolutional layers [9][10][11]. Figure 2 shows the validation losses for the three models trained respectively, with global CMVN, utterance CMVN and without any CMVN (No-CMVN).…”
Section: The Proposed Network Architecturesupporting
confidence: 89%
“…[22] has a comparable performance to ours (1.9% vs 2.18%) similar to the work of Palogiannidi et al [23], which has performances between 1.38% and 5.83% depending on the number and the types of the recurrent layers. Our approach is more appropriate for low-power devices since it is built from only convolutional layers, where the models of [6,22,23] includes many recurrent layers, which are slower compared to convolutional layers [9][10][11]. Figure 2 shows the validation losses for the three models trained respectively, with global CMVN, utterance CMVN and without any CMVN (No-CMVN).…”
Section: The Proposed Network Architecturesupporting
confidence: 89%
“…SLU systems have traditionally been a cascade of an automatic speech recognition (ASR) system converting speech into text followed by a natural language understanding (NLU) system that interprets the meaning of the text [1][2][3][4]. In contrast, an end-to-end (E2E) SLU system [5][6][7][8][9][10][11][12][13][14] processes speech input directly into meaning without going through an intermediate text transcript.…”
Section: Introductionmentioning
confidence: 99%
“…Rather than containing discrete ASR and NLU modules, E2E SLU models are trained to infer the utterance semantics directly from the spoken signal [13][14][15][16][17][18][19][20]. These models are trained to maximize the SLU prediction accuracy where the predicted semantic targets vary from just the intent [21,22], to a full interpretation with domain, intents, and slots [13].…”
Section: Introductionmentioning
confidence: 99%