Movie Description

Rohrbach, Anna; Torabi, Atousa; Rohrbach, Marcus; Tandon, Niket; Pal, Christopher; Larochelle, Hugo; Courville, Aaron; Schiele, Bernt

doi:10.1007/s11263-016-0987-1

Cited by 274 publications

(235 citation statements)

References 79 publications

Supporting

Mentioning

235

Contrasting

Order By: Relevance

“…The images in VCR are extracted from video clips from LSMDC [67] and MovieClips. These clips vary in length from a few seconds (LSMDC) to several minutes (MovieClips).…”

Section: B1 Shot Detection Pipelinementioning

confidence: 99%

From Recognition to Cognition: Visual Commonsense Reasoning

Zellers

Bisk

Farhadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

653

576

View full text Add to dashboard Cite

Why is [person4 ] pointing at [person1 ]? a) He is telling [person3 ] that [person1 ] ordered the pancakes. b) He just told a joke. c) He is feeling accusatory towards [person1 ]. d) He is giving [person1 ] directions. a) [person1 ] has the pancakes in front of him. b) [person4 ] is taking everyone's order and asked for clarification. c) [person3 ] is looking at the pancakes and both she and [person2 ] are smiling slightly. d) [person3 ] is delivering food to the table, and she might not know whose order is whose. I c h o s e a ) b e c a u s e … a) She is playing guitar for money. b) [person2 ] is a professional musician in an orchestra. c) [person2 ] and [person1 ]are both holding instruments, and were probably busking for that money. d) [person1 ] is putting money in [person2 ]'s tip jar, while she plays music. How did [person2 ] get the money that's in front of her? a) [person2 ] is selling things on the street. b) [person2 ] earned this money playing music. c) She may work jobs for the mafia. d) She won money playing poker. I c h o s e b ) b e c a u s e … Why is [person11] wearing sunglasses inside? What will [person6] do after unpacking the groceries? What are [person1] and [person2] doing? What is [person3] thinking while [person5] shakes his hand? What is [person1]'s relation to [person4]? Where is [person1] now? What would happen if [person3] fell asleep? Hypothetical 5% Scene 5% Role 7% Mental 8% Temporal 13% Activity 24% Explanation 38%

show abstract

“…The images in VCR are extracted from video clips from LSMDC [67] and MovieClips. These clips vary in length from a few seconds (LSMDC) to several minutes (MovieClips).…”

Section: B1 Shot Detection Pipelinementioning

confidence: 99%

From Recognition to Cognition: Visual Commonsense Reasoning

Zellers

Bisk

Farhadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

653

576

View full text Add to dashboard Cite

show abstract

“…Human Evaluation. Automatic metrics for evaluating generated sentences have frequently shown to be unreliable and not consistent with human judgments, especially for video description when there is only a single reference [28]. Hence, we conducted a human evaluation to evaluate the sentence quality on the test set of ActivityNet-Entities.…”

Section: Video Event Descriptionmentioning

confidence: 99%

“…because they might have appeared in similar contexts during training. This makes models less accountable and trustworthy, which is important if we hope such models will eventually assist people in need [2,28]. Additionally, grounded models can help to explain the model's decisions to humans and allow humans to diagnose them [21].…”

Section: Introductionmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

179

155

View full text Add to dashboard Cite

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

show abstract

“…For example, in video question-answering [35], most questions center around the characters asking who they are, what they do, and even why they act in certain ways. The related task of video captioning [25] often uses a character agnostic way (replacing names by someone) making the captions very artificial and uninformative (e.g. someone opens the door).…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Learning of Face Representations for Video Face Clustering

Sharma

Tapaswi

Sarfraz

et al. 2019

2019 14th IEEE International Conference on Automatic Face &Amp; Gesture Recognition (FG 2019)

View full text Add to dashboard Cite

Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks. The datasets and code are available at https://github.com/vivoutlaw/SSIAM.

show abstract

Movie Description

Cited by 274 publications

References 79 publications

From Recognition to Cognition: Visual Commonsense Reasoning

From Recognition to Cognition: Visual Commonsense Reasoning

Grounded Video Description

Self-Supervised Learning of Face Representations for Video Face Clustering

Contact Info

Product

Resources

About