2014 IEEE Conference on Computer Vision and Pattern Recognition 2014
DOI: 10.1109/cvpr.2014.99
|View full text |Cite
|
Sign up to set email alerts
|

Seeing What You're Told: Sentence-Guided Activity Recognition in Video

Abstract: We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multimodal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 23 publications
(37 citation statements)
references
References 11 publications
0
37
0
Order By: Relevance
“…Our model extends the approach of Siddharth et al (2014) in several ways. First, we depart from the dependency based representation used in that work, and recast the model to encode first order logic formulas.…”
Section: Anaphoramentioning
confidence: 94%
See 4 more Smart Citations
“…Our model extends the approach of Siddharth et al (2014) in several ways. First, we depart from the dependency based representation used in that work, and recast the model to encode first order logic formulas.…”
Section: Anaphoramentioning
confidence: 94%
“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations