2021
DOI: 10.48550/arxiv.2111.10747
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

Abstract: Referring image segmentation is a typical multi-modal task, which aims at generating a binary mask for referent described in given language expressions. Prior arts adopt a bimodal solution, taking images and languages as two modalities within an encoder-fusion-decoder pipeline. However, this pipeline is sub-optimal for the target task for two reasons. First, they only fuse high-level features produced by uni-modal encoders separately, which hinders sufficient cross-modal learning. Second, the uni-modal encoder… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(9 citation statements)
references
References 41 publications
0
9
0
Order By: Relevance
“…As shown in TABLE 7, the proposed approach outperforms MaIL and CRIS by around 2% ∼ 4% IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders. Especially, the proposed approach VLT achieves higher performance gain on more difficult dataset G-Ref that has a longer average sentence length and more complex and diverse word usages, e.g., VLT is ∼4% IoU better than MaIL [33] and LAVT [34] on test (U) of G-Ref. It demonstrates the proposed model's good ability in dealing with long and complex expressions with large diversities, which is mainly attributed to inputconditional query generation and selection that well cope with the diverse words/expressions, and masked contrastive learning that enhances the model's generalization ability.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 98%
See 3 more Smart Citations
“…As shown in TABLE 7, the proposed approach outperforms MaIL and CRIS by around 2% ∼ 4% IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders. Especially, the proposed approach VLT achieves higher performance gain on more difficult dataset G-Ref that has a longer average sentence length and more complex and diverse word usages, e.g., VLT is ∼4% IoU better than MaIL [33] and LAVT [34] on test (U) of G-Ref. It demonstrates the proposed model's good ability in dealing with long and complex expressions with large diversities, which is mainly attributed to inputconditional query generation and selection that well cope with the diverse words/expressions, and masked contrastive learning that enhances the model's generalization ability.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 98%
“…We use the popular vision transformer backbone Swin-B [83] as visual encoder and BERT [47] as textual encoder to replace the Darknet53 [71] and bi-GRU [72], respectively. Methods pretrained on large-scale vision-language datasets are marked with †, e.g., MaIL [33] adopts ViLT [38] pre-trained on four large-scale vision-language pretraining datasets and CRIS [35] employs CLIP [40] pretrained on 400M image-text pairs. As shown in TABLE 7, the proposed approach outperforms MaIL and CRIS by around 2% ∼ 4% IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Decoder-fusion methods only interact with high-level features of two modalities and miss the importance of low-level visual features such as texture, shape, and color, as well as word features. Some recent works [10,23,41] proposed encoder-fusion methods to solve the referring image segmentation problem. These methods perform cross-modal interactions on visual and linguistic features in the early encoder stage to update visual features.…”
Section: Introductionmentioning
confidence: 99%