Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.242
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Grained Grounding for Multimodal Speech Recognition

Abstract: Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recove… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 43 publications
0
4
0
Order By: Relevance
“…Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural network based VGS model [26]. There have been many improvements to the model architecture ( [27,28,29,30,31,32,33]) and new applications of VGS models such as semantic keyword spotting ( [34,35,14]), image generation [36], recovering of masked speech [37] and even models combining speech and video [38].…”
Section: Visually Grounded Speechmentioning
confidence: 99%
“…Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural network based VGS model [26]. There have been many improvements to the model architecture ( [27,28,29,30,31,32,33]) and new applications of VGS models such as semantic keyword spotting ( [34,35,14]), image generation [36], recovering of masked speech [37] and even models combining speech and video [38].…”
Section: Visually Grounded Speechmentioning
confidence: 99%
“…Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural networkbased VGS model [26]. Since then, there have been many improvements to the model architecture ( [27][28][29][30][31][32][33]), as well as new applications of VGS models such as semantic keyword spotting ( [14,34,35]), image generation [36], recovering of masked speech [37], and even the combination of speech and video [38].…”
Section: Visually Grounded Speechmentioning
confidence: 99%
“…There is some work that presents a bayesian probabilistic formulation to learn referential grounding in dialog (Liu et al, 2014), user preferences (Cadilhac et al, 2013), color descriptions (McMahan and Stone, 2015Andreas and Klein, 2014). A huge chunk of work also focus on leveraging attention mechanism for grounding multimodal phenomenon in images (Srinivasan et al, 2020;Chu et al, 2018;Fan et al, 2019;Vu et al, 2018;Kawakami et al, 2019;Dong et al, 2019), videos (Lei et al, 2020; and navigation of embodied agents (Yang et al, 2020), etc., Some approach this using data structures such as graphs in the domains of grounding images (Chang et al, 2015;Liu et al, 2014), videos ), text (Laws et al, 2010;Chen, 2012;Massé et al, 2008), entities (Zhou et al, 2018a), knowledge graphs and ontologies (Jauhar et al, 2015;Zhang et al, 2020) and interactive settings Jauhar et al (2015); Xu et al (2020).…”
Section: Stratificationmentioning
confidence: 99%
“…• Non-Textual Modality: In the visual modality, weak supervision is used in the contexts of automatic object proposals for different tasks like spoken image captioning (Srinivasan et al, 2020), visual semantic role labeling (Silberer and Pinkal, 2018), phrase grounding…”
Section: Approaches To Groundingmentioning
confidence: 99%