2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.447
|View full text |Cite
|
Sign up to set email alerts
|

Generating Descriptions with Grounded and Co-referenced People

Abstract: Learning how to generate descriptions of images or videos received major interest both in the Computer Vision and Natural Language Processing communities. While a few works have proposed to learn a grounding during the generation process in an unsupervised way (via an attention mechanism), it remains unclear how good the quality of the grounding is and whether it benefits the description quality. In this work we propose a movie description model which learns to generate description and jointly ground (localize… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
35
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 49 publications
(35 citation statements)
references
References 56 publications
0
35
0
Order By: Relevance
“…Given an image/video and a language query, image/video grounding aims to localize a spatial region in the image (Plummer et al, 2015;Yu et al, 2017Yu et al, , 2018 or a specific frame in the video (Zhou et al, 2018) which semantically corresponds to the language query. Grounding has broad applications, such as text based image retrieval (Chen et al, 2017;, description generation (Wang et al, 2018a;Rohrbach et al, 2017; A brown and white dog is lying on the grass and then it stands up.…”
Section: Introductionmentioning
confidence: 99%
“…Given an image/video and a language query, image/video grounding aims to localize a spatial region in the image (Plummer et al, 2015;Yu et al, 2017Yu et al, , 2018 or a specific frame in the video (Zhou et al, 2018) which semantically corresponds to the language query. Grounding has broad applications, such as text based image retrieval (Chen et al, 2017;, description generation (Wang et al, 2018a;Rohrbach et al, 2017; A brown and white dog is lying on the grass and then it stands up.…”
Section: Introductionmentioning
confidence: 99%
“…Visual attention usually comes in the form of temporal attention [35] (or spatial-attention [33] in the image domain), semantic attention [14,36,37,42] or both [20]. The recent unprecedented success in object detection [24,7] has regained the community's interests on detecting fine-grained visual clues while incorporating them into end-toend networks [17,27,1,16]. Description methods which are based on object detectors [17,39,1,16,5,13] tackle the captioning problem in two stages.…”
Section: Related Workmentioning
confidence: 99%
“…Instead of fine-tuning a general detector, we transfer the object classification knowledge from off-the-shelf object detectors to our model and then fine-tune this representation as part of our generation model with sparse box annotations. With a focus on co-reference resolution and identifying people, [27] proposes a framework that can refer to particular character instances and do visual co-reference resolution between video clips. However, their method is restricted to identifying human characters whereas we study more general the grounding of objects.…”
Section: Related Workmentioning
confidence: 99%
“…someone opens the door). However, recent work [24] suggests that more meaningful captions can be achieved from an improved understanding of characters. In general, the ability to predict which characters appear when and where facilitates a deeper video understanding that is grounded in the storyline.…”
Section: Introductionmentioning
confidence: 99%