Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence 2018
DOI: 10.24963/ijcai.2018/155
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

Abstract: Visual grounding aims to localize an object in an image referred to by a textual query phrase. Various visual grounding approaches have been proposed, and the problem can be modularized into a general framework: proposal generation, multimodal feature representation, and proposal ranking. Of these three modules, most existing approaches focus on the latter two, with the importance of proposal generation generally neglected. In this paper, we rethink the problem of what properties make a good proposal generator… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
68
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 116 publications
(68 citation statements)
references
References 1 publication
0
68
0
Order By: Relevance
“…They model the conditional probability P (o|r), where r is the referent and o is the appropriate visual object. Instead of modeling P (o|r) directly, others [5], [16], [20], [21], [22], [23], [24] compute P (r|o) by using the CNN-LSTM structure for language generation. The visual region o maximizing P (r|o) is considered to be the target region.…”
Section: Grounding Natural Languagementioning
confidence: 99%
“…They model the conditional probability P (o|r), where r is the referent and o is the appropriate visual object. Instead of modeling P (o|r) directly, others [5], [16], [20], [21], [22], [23], [24] compute P (r|o) by using the CNN-LSTM structure for language generation. The visual region o maximizing P (r|o) is considered to be the target region.…”
Section: Grounding Natural Languagementioning
confidence: 99%
“…Each object feature z ∈ Z is then linearly projected into a ranking score s ∈ R and a 4-D bounding box offset b ∈ R 4 , respectively. Similar to [59], we design a multi-task loss function consisting of a ranking loss L rank and a regression loss L reg :…”
Section: Task-specific Headsmentioning
confidence: 99%
“…In terms of representations of image regions and natural language referring expressions, existing approaches for referring expression comprehension can be generalized into two categories: (1) visual representations un-enriched models, which directly extract deep features from a pretrained CNN as the visual representations of detected image regions (Mao et al, 2016 ; Yu et al, 2016 , 2017 ; Hu et al, 2017 ; Deng et al, 2018 ; Zhang et al, 2018 ; Zhuang et al, 2018 ). (2) visual representations enriched models, which enhance the visual representations by adding external visual information for regions (Liu et al, 2017 ; Yu et al, 2018a , b ). Liu et al ( 2017 ) leveraged external knowledge acquired by an attributes learning model to enrich the information of regions.…”
Section: Related Workmentioning
confidence: 99%