2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.143
|View full text |Cite
|
Sign up to set email alerts
|

Recurrent Multimodal Interaction for Referring Image Segmentation

Abstract: In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential inter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
174
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 217 publications
(175 citation statements)
references
References 27 publications
1
174
0
Order By: Relevance
“…IoU RMI-LSTM [17] 42 [15] are slightly higher than original numbers reported in their paper which did not use DenseCRF postprocessing.…”
Section: Ablation Studymentioning
confidence: 58%
See 3 more Smart Citations
“…IoU RMI-LSTM [17] 42 [15] are slightly higher than original numbers reported in their paper which did not use DenseCRF postprocessing.…”
Section: Ablation Studymentioning
confidence: 58%
“…They are then concatenated together for spatial mask prediction. To better achieve word-to-image interaction, [17] directly combines visual features with each word feature from a language LSTM to recurrently refine segmentation results. Dynamic filter [20] for each word further enhances this interaction.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…We propose a simple, fast, and accurate one-stage approach to visual grounding, which aims to ground a natural language query (phrase or sentence) about an image onto a correct region of the image. By defining visual grounding at this level, we deliberately abstract away the subtle distinctions between phrase localization [30,42], referring expression comprehension [15,24,48,47,22], natural language object retrieval [14,16], visual question segmentation [9,13,20,25], etc., each of which can be seen as a variation of the general visual grounding problem. We benchmark our one-stage approach for both phrase localization and referring expression comprehension.…”
Section: Introductionmentioning
confidence: 99%