2019
DOI: 10.1109/tpami.2017.2786699
|View full text |Cite
|
Sign up to set email alerts
|

Disambiguating Visual Verbs

Abstract: In this article, we introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce a new dataset, which we call VerSe (short for Verb Sense) that augments existing… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
41
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7

Relationship

2
5

Authors

Journals

citations
Cited by 10 publications
(44 citation statements)
references
References 50 publications
(63 reference statements)
3
41
0
Order By: Relevance
“…In the present paper, we show that the heatmaps generated by the verb prediction model of Gella et al (2018) correlate well with heatmaps obtained from human observers performing a verb classification task. We achieve a higher correlation than a range of baselines (center bias, visual salience, and model combinations), indicating that the verb prediction model successfully identifies those image regions that are indicative of the verb depicted in the image.…”
Section: Introductionsupporting
confidence: 63%
See 2 more Smart Citations
“…In the present paper, we show that the heatmaps generated by the verb prediction model of Gella et al (2018) correlate well with heatmaps obtained from human observers performing a verb classification task. We achieve a higher correlation than a range of baselines (center bias, visual salience, and model combinations), indicating that the verb prediction model successfully identifies those image regions that are indicative of the verb depicted in the image.…”
Section: Introductionsupporting
confidence: 63%
“…Verb Prediction Model (M) In our study, we used the verb prediction model proposed by Gella et al (2018), which employs a multilabel CNNbased classification approach and is designed to simultaneously predict all verbs associated with an image. This model is trained over a vocabulary that consists of the 250 most common verbs in the TUHOI, Flickr30k, and COCO image description datasets.…”
Section: Fixation Prediction Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, such a theory has neither been formalized nor evaluated to predicting verb frame extensions through time. Separately, computational work in multimodal semantics has suggested how word meanings warrant a richer representation beyond purely linguistic knowledge (e.g., Bruni et al 2012;Gella et al 2016Gella et al , 2017. However, multimodal semantic representations have neither been examined in the diachronics of compositionality nor in light of the cognitive theories of chaining.…”
Section: Introductionmentioning
confidence: 99%
“…Word sense disambiguation is typically tackled using only textual context; however, in a multimodal setting, visual context is also available and can be used for disambiguation. Most prior work on visual word sense disambiguation has targeted noun senses (Barnard and Johnson, 2005;Loeff et al, 2006;Saenko and Darrell, 2008), but the task has recently been extended to verb senses (Gella et al, 2016(Gella et al, , 2019. Resolving sense ambiguity is particularly crucial for translation tasks, as words can have more than one translation, and these translations often correspond to word senses (Carpuat and Wu, 2007;Navigli, 2009).…”
Section: Introductionmentioning
confidence: 99%