2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00205
|View full text |Cite
|
Sign up to set email alerts
|

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Abstract: Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that ther… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
98
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 167 publications
(99 citation statements)
references
References 42 publications
1
98
0
Order By: Relevance
“…In future work we plan to replicate the results reported by Krishna et al (2018) and to compare it with our object-only baseline. We hope to do the same for other published results on referring relationships using the VRD dataset, among other datasets (Cirik et al, 2018a;Liu et al, 2019;Raboh et al, 2019).…”
Section: Revisiting "Referring Relationship" Groundingmentioning
confidence: 55%
“…In future work we plan to replicate the results reported by Krishna et al (2018) and to compare it with our object-only baseline. We hope to do the same for other published results on referring relationships using the VRD dataset, among other datasets (Cirik et al, 2018a;Liu et al, 2019;Raboh et al, 2019).…”
Section: Revisiting "Referring Relationship" Groundingmentioning
confidence: 55%
“…The first experiment compared the performance of our method and several baseline models in matching detection task. Based on CMNs [23], MAttNet [24], and CM‐Att [46], we propose the baseline models for matching detection called CMNS‐baseline, MAttNet‐baseline, and CM‐Att‐baseline. The baseline models determine whether the expression matches the image by setting a threshold in its predicted highest score.…”
Section: Methodsmentioning
confidence: 99%
“…Many tasks [1,2,3,4,5,6] have been explored to link the relation between vision and language. Among these tasks, image-text matching [1,3] aims to learn the semantic similarities between images and sentences, which can be applied to bi-directional image and text retrieval tasks that retrieve images given sentences or retrieve sentences for a query image.…”
Section: Introductionmentioning
confidence: 99%