Top-Down Attention in End-to-End Spoken Language Understanding

Chen, Yixin; Lu, Weiyi; Mottini, Alejandro; Li, Li Erran; Droppo, Jasha; Du, Zheng; Zeng, Belinda

doi:10.1109/icassp39728.2021.9414313

Cited by 7 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, long dwell times testify to the level of top-down attention, i.e., the attention driven by what the participant already knows. Language, and specifically the semantics of classifier phrases, represents one kind of knowledge that can drive top-down attention (Baluch and Itti, 2011;Chen et al, 2021). Finally, we think the relevance of dwell times is particularly apparent in a Visual World Paradigm experiment, since they correlate with situational awareness (Hauland and Duijm, 2002), and indicate that participants refrain from looking at contextually irrelevant stimuli (Mohanty and Sussman, 2013).…”

Section: Discussionmentioning

confidence: 97%

Tracking semantic relatedness: numeral classifiers guide gaze to visual world objects

Lobben,

Bochynska,

Eifring

et al. 2023

Front. Lang. Sci.

View full text Add to dashboard Cite

Directing visual attention toward items mentioned within utterances can optimize understanding the unfolding spoken language and preparing appropriate behaviors. In several languages, numeral classifiers specify semantic classes of nouns but can also function as reference trackers. Whereas all classifier types function to single out objects for reference in the real world and may assist attentional guidance, we propose that only sortal classifiers efficiently guide visual attention by being inherently attached to the nouns' semantics, since container classifiers are pragmatically attached to the nouns they classify, and the default classifiers index a noun without specifying the semantics. By contrast, container classifiers are pragmatically attached, and default classifiers index a noun without specifying the semantics. Using eye tracking and the “visual world paradigm”, we had Chinese speakers (N = 20) listen to sentences and we observed that they looked spontaneously within 150 ms after offset of the Sortal classifier. After about 200 ms the same occurred for the container classifiers, but with the default classifier only after about 700 ms. This looking pattern was absent in a control group of non-Chinese speakers and the Chinese speakers' gaze behavior can therefore only be ascribed to classifier semantics and not to artifacts of the visual objects. Thus, we found that classifier types affect the rapidity of spontaneously looking at the target objects on a screen. These significantly different latencies indicate that the stronger the semantic relatedness between a classifier and its noun, the more efficient the deployment of overt attention.

show abstract

Section: Discussionmentioning

confidence: 97%

Tracking semantic relatedness: numeral classifiers guide gaze to visual world objects

Lobben,

Bochynska,

Eifring

et al. 2023

Front. Lang. Sci.

View full text Add to dashboard Cite

show abstract

“…An end-to-end (E2E) speech processing system leverages a single model which takes the input speech and performs spoken language processing tasks simultaneously. E2E models draw increasing attention due to less computational complexity and error propagation mitigation Tian and Gorinski, 2020;Sharma et al, 2021;Lugosch et al, 2020;Wang et al, 2020;Chen et al, 2021b). However, a challenge of E2E model training is the collection of enormous annotated spoken data, which are significantly more expensive to collect compared with the text-only counterpart.…”

Section: Introductionmentioning

confidence: 99%

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Lu,

Huang,

Zheng

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) and EM-Tree accuracies on STOP respectively. With fewer parameters, the results of LaSyn are competitive to published state-of-the-art works. The results demonstrate the quality of the augmented training data.

show abstract

“…Deep, end-to-end models [3][4][5][6][7][8] are adopted for these complicated tasks due to advancements in model architectures and computing capabilities. End-to-end architectures typically outperform traditional, modular architectures without requiring domain expertise or feature engineering [9].…”

Section: Introductionmentioning

confidence: 99%

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

Arora¹,

Ostapenko²,

Viswanathan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community. 1

show abstract

Top-Down Attention in End-to-End Spoken Language Understanding

Cited by 7 publications

References 16 publications

Tracking semantic relatedness: numeral classifiers guide gaze to visual world objects

Tracking semantic relatedness: numeral classifiers guide gaze to visual world objects

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

Contact Info

Product

Resources

About