Seeing What You're Told: Sentence-Guided Activity Recognition in Video

Narayanaswamy, Siddharth; Barbu, Andrei; Siskind, Jeffrey Mark

doi:10.1109/cvpr.2014.99

Cited by 23 publications

(37 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our model extends the approach of Siddharth et al (2014) in several ways. First, we depart from the dependency based representation used in that work, and recast the model to encode first order logic formulas.…”

Section: Anaphoramentioning

confidence: 94%

“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”

Section: Related Workmentioning

confidence: 99%

“…The closest dataset is that of Siddharth et al (2014) as it controls for object appearance, color, action, and direction of motion, making it more likely to be suitable for evaluating disambiguation tasks. Unfortunately, that dataset was designed to avoid ambiguities, and therefore is not suitable for evaluating the work described here.…”

Section: Semanticsmentioning

confidence: 99%

“…To perform the disambiguation task, we extend the sentence recognition model of Siddharth et al (2014) which represents sentences as compositions of words. Given a sentence, its first order logic interpretation and a video, our model produces a score which determines if the sentence is depicted by the video.…”

Section: Modelmentioning

confidence: 99%

“…Our approach for tackling this task extends the sentence tracker introduced in (Siddharth et al, 2014). The sentence tracker produces a score which determines if a sentence is depicted by a video.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Berzak¹,

Barbu²,

Harari³

et al. 2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

show abstract

Section: Anaphoramentioning

confidence: 94%

Section: Related Workmentioning

confidence: 99%

Section: Semanticsmentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Berzak¹,

Barbu²,

Harari³

et al. 2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

Coherent Multi-sentence Video Description with Variable Level of Detail

Rohrbach

Qiu

et al. 2014

Lecture Notes in Computer Science

178

160

View full text Add to dashboard Cite

Collecting and annotating the large continuous action dataset

Barrett

et al. 2016

Machine Vision and Applications

Self Cite

View full text Add to dashboard Cite

We make available to the community a new dataset to support action recognition research. This dataset is different from prior datasets in several key ways. It is significantly larger. It contains streaming video with long segments containing multiple action occurrences that often overlap in space and/or time. All actions were filmed in the same collection of backgrounds so that background gives little clue as to action class. We had five humans to replicate the annotation of temporal extent of action occurrences labeled with their classes and measured a surprisingly low level of intercoder agreement. Baseline experiments show that recent state-of-the-art methods perform poorly on this dataset. This suggests that this will be a challenging dataset to foster advances in action recognition research. This manuscript serves to describe the novel content and characteristics of the LCA dataset, present the design decisions made when filming the dataset, document the novel methods employed to annotate the dataset, and present the results of our baseline experiments.

show abstract

Seeing What You're Told: Sentence-Guided Activity Recognition in Video

Cited by 23 publications

References 11 publications

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Coherent Multi-sentence Video Description with Variable Level of Detail

Collecting and annotating the large continuous action dataset

Contact Info

Product

Resources

About