2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01762
|View full text |Cite
|
Sign up to set email alerts
|

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Abstract: Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
98
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 162 publications
(98 citation statements)
references
References 52 publications
0
98
0
Order By: Relevance
“…As shown in TABLE 7, the proposed approach outperforms MaIL and CRIS by around 2% ∼ 4% IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders. Especially, the proposed approach VLT achieves higher performance gain on more difficult dataset G-Ref that has a longer average sentence length and more complex and diverse word usages, e.g., VLT is ∼4% IoU better than MaIL [33] and LAVT [34] on test (U) of G-Ref. It demonstrates the proposed model's good ability in dealing with long and complex expressions with large diversities, which is mainly attributed to inputconditional query generation and selection that well cope with the diverse words/expressions, and masked contrastive learning that enhances the model's generalization ability.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…As shown in TABLE 7, the proposed approach outperforms MaIL and CRIS by around 2% ∼ 4% IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders. Especially, the proposed approach VLT achieves higher performance gain on more difficult dataset G-Ref that has a longer average sentence length and more complex and diverse word usages, e.g., VLT is ∼4% IoU better than MaIL [33] and LAVT [34] on test (U) of G-Ref. It demonstrates the proposed model's good ability in dealing with long and complex expressions with large diversities, which is mainly attributed to inputconditional query generation and selection that well cope with the diverse words/expressions, and masked contrastive learning that enhances the model's generalization ability.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Our concurrent work MDETR [31] employs DETR [15] to build an endto-end modulated detector and reason jointly over language and image. After the proposed VLT [32], transformer-based referring segmentation architectures receive more attention [33], [34], [35], [36], [37]. MaIL [33] follows the transformer architecture ViLT [38] and utilizes instance mask predicted by Mask R-CNN [39] as additional input.…”
Section: Referring Segmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent studies [10,19,41] have shown that cross-modal interactions during feature extraction can further enhance multi-modal alignment. Feng et al [10] replaced the vision encoder with a multi-modal encoder by adopting an early cross-modal interaction strategy, which achieved deep interweaving between visual and linguistic features.…”
Section: Related Workmentioning
confidence: 99%
“…Decoder-fusion methods only interact with high-level features of two modalities and miss the importance of low-level visual features such as texture, shape, and color, as well as word features. Some recent works [10,23,41] proposed encoder-fusion methods to solve the referring image segmentation problem. These methods perform cross-modal interactions on visual and linguistic features in the early encoder stage to update visual features.…”
Section: Introductionmentioning
confidence: 99%