Generating Descriptions with Grounded and Co-referenced People

Rohrbach, Anna; Rohrbach, Marcus; Tang, Siyu; Oh, Seong Joon; Schiele, Bernt

doi:10.1109/cvpr.2017.447

Cited by 49 publications

(35 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given an image/video and a language query, image/video grounding aims to localize a spatial region in the image (Plummer et al, 2015;Yu et al, 2017Yu et al, , 2018 or a specific frame in the video (Zhou et al, 2018) which semantically corresponds to the language query. Grounding has broad applications, such as text based image retrieval (Chen et al, 2017;, description generation (Wang et al, 2018a;Rohrbach et al, 2017; A brown and white dog is lying on the grass and then it stands up.…”

Section: Introductionmentioning

confidence: 99%

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Chen¹,

Ma²,

Luo³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatiotemporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the Im-ageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches. Our code and the constructed VID-sentence dataset

show abstract

Section: Introductionmentioning

confidence: 99%

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Chen¹,

Ma²,

Luo³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Visual attention usually comes in the form of temporal attention [35] (or spatial-attention [33] in the image domain), semantic attention [14,36,37,42] or both [20]. The recent unprecedented success in object detection [24,7] has regained the community's interests on detecting fine-grained visual clues while incorporating them into end-toend networks [17,27,1,16]. Description methods which are based on object detectors [17,39,1,16,5,13] tackle the captioning problem in two stages.…”

Section: Related Workmentioning

confidence: 99%

“…Instead of fine-tuning a general detector, we transfer the object classification knowledge from off-the-shelf object detectors to our model and then fine-tune this representation as part of our generation model with sparse box annotations. With a focus on co-reference resolution and identifying people, [27] proposes a framework that can refer to particular character instances and do visual co-reference resolution between video clips. However, their method is restricted to identifying human characters whereas we study more general the grounding of objects.…”

Section: Related Workmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

179

155

View full text Add to dashboard Cite

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

show abstract

“…someone opens the door). However, recent work [24] suggests that more meaningful captions can be achieved from an improved understanding of characters. In general, the ability to predict which characters appear when and where facilitates a deeper video understanding that is grounded in the storyline.…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Learning of Face Representations for Video Face Clustering

Sharma

Tapaswi

Sarfraz

et al. 2019

2019 14th IEEE International Conference on Automatic Face &Amp; Gesture Recognition (FG 2019)

View full text Add to dashboard Cite

Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks. The datasets and code are available at https://github.com/vivoutlaw/SSIAM.

show abstract

Generating Descriptions with Grounded and Co-referenced People

Cited by 49 publications

References 56 publications

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Grounded Video Description

Self-Supervised Learning of Face Representations for Video Face Clustering

Contact Info

Product

Resources

About