Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan, Tejas; Sanabria, Ramon; Metze, Florian; Elliott, Desmond

doi:10.18653/v1/2020.findings-emnlp.242

Cited by 7 publications

(4 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural network based VGS model [26]. There have been many improvements to the model architecture ( [27,28,29,30,31,32,33]) and new applications of VGS models such as semantic keyword spotting ( [34,35,14]), image generation [36], recovering of masked speech [37] and even models combining speech and video [38].…”

Section: Visually Grounded Speechmentioning

confidence: 99%

Modelling word learning and recognition using visually grounded speech

Merkx¹,

Scholten²,

Frank³

et al. 2022

Preprint

View full text Add to dashboard Cite

Background: Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. Methods:We investigate the time-course of word recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word-competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for discrete representation

show abstract

Section: Visually Grounded Speechmentioning

confidence: 99%

Modelling word learning and recognition using visually grounded speech

Merkx¹,

Scholten²,

Frank³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural networkbased VGS model [26]. Since then, there have been many improvements to the model architecture ( [27][28][29][30][31][32][33]), as well as new applications of VGS models such as semantic keyword spotting ( [14,34,35]), image generation [36], recovering of masked speech [37], and even the combination of speech and video [38].…”

Section: Visually Grounded Speechmentioning

confidence: 99%

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

et al. 2022

View full text Add to dashboard Cite

Many computational models of speech recognition assume that the set of target words is already given. This implies that these models learn to recognise speech in a biologically unrealistic manner, i.e. with prior lexical knowledge and explicit supervision. In contrast, visually grounded speech models learn to recognise speech without prior lexical knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. We investigate the time course of noun and verb recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for discrete representation learning, aids the model in the discovery and recognition of words. Our experiments show that the model is able to recognise nouns in isolation and even learns to properly differentiate between plural and singular nouns. We also find that recognition is influenced by word competition from the word-initial cohort and neighbourhood density, mirroring word competition effects in human speech comprehension. Lastly, we find no evidence that vector quantisation is helpful in discovering and recognising words, though our gating experiment does show that the LSTM-VQ model is able to recognise the target words earlier.

show abstract

“…There is some work that presents a bayesian probabilistic formulation to learn referential grounding in dialog (Liu et al, 2014), user preferences (Cadilhac et al, 2013), color descriptions (McMahan and Stone, 2015Andreas and Klein, 2014). A huge chunk of work also focus on leveraging attention mechanism for grounding multimodal phenomenon in images (Srinivasan et al, 2020;Chu et al, 2018;Fan et al, 2019;Vu et al, 2018;Kawakami et al, 2019;Dong et al, 2019), videos (Lei et al, 2020; and navigation of embodied agents (Yang et al, 2020), etc., Some approach this using data structures such as graphs in the domains of grounding images (Chang et al, 2015;Liu et al, 2014), videos ), text (Laws et al, 2010;Chen, 2012;Massé et al, 2008), entities (Zhou et al, 2018a), knowledge graphs and ontologies (Jauhar et al, 2015;Zhang et al, 2020) and interactive settings Jauhar et al (2015); Xu et al (2020).…”

Section: Stratificationmentioning

confidence: 99%

“…• Non-Textual Modality: In the visual modality, weak supervision is used in the contexts of automatic object proposals for different tasks like spoken image captioning (Srinivasan et al, 2020), visual semantic role labeling (Silberer and Pinkal, 2018), phrase grounding…”

Section: Approaches To Groundingmentioning

confidence: 99%

Grounding ‘Grounding’ in NLP

Chandu

Bisk

Black

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term broadly to reference any linking of text to data or non-textual modality. In contrast, Cognitive Science more formally defines "grounding" as the process of establishing what mutual information is required for successful communication between two interlocutorsa definition which might implicitly capture the NLP usage but differs in intent and scope.We investigate the gap between these definitions and seek answers to the following questions: (1) What aspects of grounding are missing from NLP tasks? Here we present the dimensions of coordination, purviews and constraints.(2) How is the term "grounding" used in the current research? We study the trends in datasets, domains, and tasks introduced in recent NLP conferences. And finally, (3) How to advance our current definition to bridge the gap with Cognitive Science? We present ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.

show abstract

Fine-Grained Grounding for Multimodal Speech Recognition

Cited by 7 publications

References 43 publications

Modelling word learning and recognition using visually grounded speech

Modelling word learning and recognition using visually grounded speech

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

Grounding ‘Grounding’ in NLP

Contact Info

Product

Resources

About