Recurrent Multimodal Interaction for Referring Image Segmentation

Liu, Chenxi; Lin, Zhe; Shen, Xiaohui; Yang, Jimei; Lu, Xin; Yuille, Alan

doi:10.1109/iccv.2017.143

Cited by 217 publications

(175 citation statements)

References 27 publications

Supporting

Mentioning

174

Contrasting

Order By: Relevance

“…IoU RMI-LSTM [17] 42 [15] are slightly higher than original numbers reported in their paper which did not use DenseCRF postprocessing.…”

Section: Ablation Studymentioning

confidence: 58%

“…They are then concatenated together for spatial mask prediction. To better achieve word-to-image interaction, [17] directly combines visual features with each word feature from a language LSTM to recurrently refine segmentation results. Dynamic filter [20] for each word further enhances this interaction.…”

Section: Related Workmentioning

confidence: 99%

“…In addition to visual features and word vectors, spatial coordinate features have also been shown to be useful for referring image segmentation [10,15,17]. Following prior works, we define an 8-D spatial coordinate feature at each spatial position using the implementation in [17]. The first 3-dimensions of the feature map encode the normalized horizontal positions.…”

Section: Multimodal Featuresmentioning

confidence: 99%

“…Implementation details: Following previous work [15,17,22], we keep the maximum length of query expression as 20 and embed each word to a vector of C l = 1000 dimensions. Given an input image, we resize it to 320 × 320 and use the outputs of DeepLab-101 ResNet blocks Res3, Res4, Res5 as the inputs for multimodal features.…”

Section: Datasets and Setupmentioning

confidence: 99%

See 3 more Smart Citations

Cross-Modal Self-Attention Network for Referring Image Segmentation

Rochan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

413

237

View full text Add to dashboard Cite

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

show abstract

“…IoU RMI-LSTM [17] 42 [15] are slightly higher than original numbers reported in their paper which did not use DenseCRF postprocessing.…”

Section: Ablation Studymentioning

confidence: 58%

Section: Related Workmentioning

confidence: 99%

Section: Multimodal Featuresmentioning

confidence: 99%

Section: Datasets and Setupmentioning

confidence: 99%

See 2 more Smart Citations

Cross-Modal Self-Attention Network for Referring Image Segmentation

Rochan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

413

237

View full text Add to dashboard Cite

show abstract

“…We propose a simple, fast, and accurate one-stage approach to visual grounding, which aims to ground a natural language query (phrase or sentence) about an image onto a correct region of the image. By defining visual grounding at this level, we deliberately abstract away the subtle distinctions between phrase localization [30,42], referring expression comprehension [15,24,48,47,22], natural language object retrieval [14,16], visual question segmentation [9,13,20,25], etc., each of which can be seen as a variation of the general visual grounding problem. We benchmark our one-stage approach for both phrase localization and referring expression comprehension.…”

Section: Introductionmentioning

confidence: 99%

A Fast and Accurate One-Stage Approach to Visual Grounding

Yang

Gong

Wang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

323

286

View full text Add to dashboard Cite

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight. The performances of existing propose-and-rank twostage methods are capped by the quality of the region candidates they propose in the first stage -if none of the candidates could cover the ground truth region, there is no hope in the second stage to rank the right region to the top. To avoid this caveat, we propose a one-stage model that enables end-to-end joint optimization. The main idea is as straightforward as fusing a text query's embedding into the YOLOv3 object detector, augmented by spatial features so as to account for spatial mentions in the query. Despite being simple, this one-stage approach shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension, according to our experiments. Given these results along with careful investigations into some popular region proposals, we advocate for visual grounding a paradigm shift from the conventional two-stage methods to the one-stage framework.

show abstract