2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00850
|View full text |Cite
|
Sign up to set email alerts
|

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Abstract: Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
162
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 191 publications
(162 citation statements)
references
References 47 publications
0
162
0
Order By: Relevance
“…For example, we could learn a ranker to help us choose the best contents (Stent et al, 2004). Or we could manually define some matching rules to help rank the selection (Cornia et al, 2018). In Table 2, we show the VRS model achieves very high metric scores based on an oracle ranker, so learning a ranker should be able to improve the performance straightforwardly.…”
Section: A Performance/controllability Trade-offmentioning
confidence: 99%
“…For example, we could learn a ranker to help us choose the best contents (Stent et al, 2004). Or we could manually define some matching rules to help rank the selection (Cornia et al, 2018). In Table 2, we show the VRS model achieves very high metric scores based on an oracle ranker, so learning a ranker should be able to improve the performance straightforwardly.…”
Section: A Performance/controllability Trade-offmentioning
confidence: 99%
“…Lu et al [15] first generate a sentence template with blank slots which will be filled in by visual concepts using object detectors. Cornia et al [16] propose a controllable approach to shift the rank of salient image regions by shift gate with adaptive attention. Li et al [17] introduce a new architecture to facilitate vocabulary expansion and produce novel objects via pointing mechanism and object learners.…”
Section: Related Workmentioning
confidence: 99%
“…The authors declare no conflict of interest. [20] x x Visual Genome Dataset [21] x x x 19.9 13.7 13.1 [22] x x x x [23] x [35] x x Recall Evaluation metric [36] x x OI, VG, VRD [37] x x X X 71.6 51.8 37.1 26.5 24.3 [38] x x APRC, CSMC [39] x x x x F-1 score metrics 21.6 [45] x x x x IAPRTC-12 [46] x x x x [47] x x x x x R [48] x…”
Section: Conflicts Of Interestmentioning
confidence: 99%