ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414313
|View full text |Cite
|
Sign up to set email alerts
|

Top-Down Attention in End-to-End Spoken Language Understanding

Abstract: Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. More recently, End-to-End SLU (E2E SLU) was proposed to perform SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcom… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…Furthermore, long dwell times testify to the level of top-down attention, i.e., the attention driven by what the participant already knows. Language, and specifically the semantics of classifier phrases, represents one kind of knowledge that can drive top-down attention (Baluch and Itti, 2011;Chen et al, 2021). Finally, we think the relevance of dwell times is particularly apparent in a Visual World Paradigm experiment, since they correlate with situational awareness (Hauland and Duijm, 2002), and indicate that participants refrain from looking at contextually irrelevant stimuli (Mohanty and Sussman, 2013).…”
Section: Discussionmentioning
confidence: 97%
“…Furthermore, long dwell times testify to the level of top-down attention, i.e., the attention driven by what the participant already knows. Language, and specifically the semantics of classifier phrases, represents one kind of knowledge that can drive top-down attention (Baluch and Itti, 2011;Chen et al, 2021). Finally, we think the relevance of dwell times is particularly apparent in a Visual World Paradigm experiment, since they correlate with situational awareness (Hauland and Duijm, 2002), and indicate that participants refrain from looking at contextually irrelevant stimuli (Mohanty and Sussman, 2013).…”
Section: Discussionmentioning
confidence: 97%
“…An end-to-end (E2E) speech processing system leverages a single model which takes the input speech and performs spoken language processing tasks simultaneously. E2E models draw increasing attention due to less computational complexity and error propagation mitigation Tian and Gorinski, 2020;Sharma et al, 2021;Lugosch et al, 2020;Wang et al, 2020;Chen et al, 2021b). However, a challenge of E2E model training is the collection of enormous annotated spoken data, which are significantly more expensive to collect compared with the text-only counterpart.…”
Section: Introductionmentioning
confidence: 99%
“…Deep, end-to-end models [3][4][5][6][7][8] are adopted for these complicated tasks due to advancements in model architectures and computing capabilities. End-to-end architectures typically outperform traditional, modular architectures without requiring domain expertise or feature engineering [9].…”
Section: Introductionmentioning
confidence: 99%