2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.327
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Motivations of Actions by Leveraging Text

Abstract: Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
33
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 44 publications
(33 citation statements)
references
References 41 publications
0
33
0
Order By: Relevance
“…Similar abilities are required in the task of generating non-grounded, human-like questions about an image (Mostafazadeh et al, 2016;Jain et al, 2017), and in that of asking discriminative questions over pairs of similar scenes . Related tasks are also those of predicting motivations of visually-grounded actions (Vondrick et al, 2016) or generating explanations for a given answer (Park et al, 2018;.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Similar abilities are required in the task of generating non-grounded, human-like questions about an image (Mostafazadeh et al, 2016;Jain et al, 2017), and in that of asking discriminative questions over pairs of similar scenes . Related tasks are also those of predicting motivations of visually-grounded actions (Vondrick et al, 2016) or generating explanations for a given answer (Park et al, 2018;.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, to ensure that actions contained information that was grounded in the image, participants were asked to mention at least one visible entity when writing their action (see errors and warnings in Figure 2). 3 We randomly selected ∼ 3.6K images from the split by Vondrick et al (2016) and, for each of them, we collected on average 5 intention, action tu- 2.…”
Section: Data Collectionmentioning
confidence: 99%
“…Understanding activities in a human-centric fashion encodes our particular experiences with the visual world. Understanding activities with emphasis on objects has been a particularly fruitful direction [26,37,9,35,55]. In a similar vein, some works have also tried modeling activities as transformations [59] or state changes [5].…”
Section: Related Workmentioning
confidence: 99%
“…However, for detailed video understanding, one needs to obtain descriptions that go beyond observable visual entities and use background knowledge and commonsense to reason about objects and actions. Work for inferring motivations of human actions in static images by incorporating commonsense knowledge are reflected in Pirsiavash et al (2014); Vondrick et al (2016). Commonsense caption generation has been approached on abstract scenes and clipart images in Vedantam et al (2015).…”
Section: Related Workmentioning
confidence: 99%
“…A critical missing element in complex video understanding is the capability of performing commonsense inference, especially a generative model. Existing efforts seek to find textual explanations or intentions of human activities as a classification task (Vondrick et al, 2016) or a vision-to-text alignment problem (Zhu et al, 2015).…”
Section: Introductionmentioning
confidence: 99%