Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.244
|View full text |Cite
|
Sign up to set email alerts
|

Textual Supervision for Visually Grounded Spoken Language Understanding

Abstract: Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(13 citation statements)
references
References 26 publications
0
13
0
Order By: Relevance
“…Pasad, Shi, Kamper, and Livescu (2019) use a very similar approach while specifically focusing on low-resource settings and testing the effect of varying the amount of textual supervision. Higy, Elliott, and Chrupała (2020) investigate two forms of textual supervision in low-resource settings: transcriptions, and text translations. They also compare the multi-task learning approaches to simple pipeline architectures where text transcriptions are used to train an ASR module, and find that in most cases the pipeline is hard to improve on.…”
Section: Auxiliary Textual Supervisionmentioning
confidence: 99%
“…Pasad, Shi, Kamper, and Livescu (2019) use a very similar approach while specifically focusing on low-resource settings and testing the effect of varying the amount of textual supervision. Higy, Elliott, and Chrupała (2020) investigate two forms of textual supervision in low-resource settings: transcriptions, and text translations. They also compare the multi-task learning approaches to simple pipeline architectures where text transcriptions are used to train an ASR module, and find that in most cases the pipeline is hard to improve on.…”
Section: Auxiliary Textual Supervisionmentioning
confidence: 99%
“…Pasad et al (2019) use a very similar approach while specifically focusing on low-resource settings and testing the effect of varying the amount of textual supervision. Higy et al (2020) investigate two forms of textual supervision in low-resource settings: transcriptions, and text translations. They also compare the multi-task learning approaches to simple pipeline architectures and find that in most cases pipeline is hard to improve on.…”
Section: Variants and Applicationsmentioning
confidence: 99%
“…A number of works address cognitive and linguistic questions, such as understanding how different learned layers correspond to visual stimuli [3,4], learning linguistic units [5,6] or how visually grounded representations and data can help understand lexical competition in phonemic processing [7]. Other work addresses applied tasks, including multimodal retrieval [8,9,10], predicting written keywords given speech and image inputs [11], cross-modality alignment [12], retrieving speech in different languages using images as a pivot modality [13,14,15], and speech-to-speech retrieval [14,15]).…”
Section: Freeing Speechmentioning
confidence: 99%
“…More recent work suggests that this same architecture can be used for a many tasks. However, only a few have specifically focus on improving retrieval [8,9], and no one has systematically evaluated the effectiveness of different design choices on multiple datasets.…”
Section: Corralling Multiple Multimodal Choicesmentioning
confidence: 99%
See 1 more Smart Citation