Bi-Directional Relationship Inferring Network for Referring Image Segmentation

Hu, Zhiwei; Feng, Guang; Sun, Jiayu; Zhang, Lihe; Lu, Huchuan

doi:10.1109/cvpr42600.2020.00448

Cited by 137 publications

(101 citation statements)

References 28 publications

Supporting

Mentioning

101

Contrasting

Order By: Relevance

“…Finally, the cross-modal features are used to generate the final prediction masks. Unlike existing RES methods [13,14], which segment objects according to the query text, we input text in parallel with the input image to extract information. By combining crossmodal features from both image and text, we accurately segment fluorescein leakage.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

Yang¹,

Chen²,

Qiao

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Automatic segmentation of fluorescein leakage in fundus fluorescein angiography images is important in the clinical diagnosis of advanced diabetic retinopathy. Despite the recent success of deep-learning-based models in improving medical image segmentation, segmentation of fluorescein leakage has been ignored owing to (1) a lack of publicly available data with sufficient annotations for training a segmentation network and (2) incapability of supervised models to accurately localize fluorescein leakage at different imaging angles. To address these issues, we studied the automatic segmentation of fluorescein leakage in fundus fluorescein angiography images and devised a method involving (1) a cross-modal learning framework for fluorescein leakage segmentation using both image and text data, (2) a dual attention learning module for identifying important linguistic and visual features, and (3) fluorescein-related-keyword classification for identifying meaningful textual expressions pertaining to the location and type of fluorescein leakage. We demonstrate the effectiveness of the proposed method for an in-house fundus fluorescein angiography image data set.

show abstract

Section: Methodsmentioning

confidence: 99%

“…1c). The recent success of reference expression segmentation (RES), which involves the use of natural language expressions to locate objects [13,14], suggests the possibility of using cross-modal data to build a robust and effective framework for fluorescein leakage segmentation.…”

Section: Introductionmentioning

confidence: 99%

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

Yang¹,

Chen²,

Qiao

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

show abstract

“…Previous approaches [3,15,9,10] have shown that integrating multi-modal features from different levels of CNN can further improve the accuracy of segmentation masks. In our work, we introduce a bi-directionally convolutional GRU (Bi-ConvGRU) to progressively integrate the fused multimodal features in bottom-up and top-down manners, which is corresponding to the two directions of forward and backward paths.…”

Section: Multi-level Feature Fusion and Mask Predictionmentioning

confidence: 99%

“…where − → H and ← − H denote the last hidden states of two directions, b is the bias term, and V out ∈ R h×w×D o is the integrated multi-modal features. Finally, the same decoder layers and binary cross-entropy loss as previous works [3,10] are adopted to predict the segmentation mask and optimize the network, respectively.…”

Section: Multi-level Feature Fusion and Mask Predictionmentioning

confidence: 99%

See 1 more Smart Citation

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Yang

Huang

et al. 2020

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multiscale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multiscale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

show abstract