2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01761
|View full text |Cite
|
Sign up to set email alerts
|

ReSTR: Convolution-free Referring Image Segmentation Using Transformers

Abstract: Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision.To this end, we first present a new model that discovers semantic entities in input image and then … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 97 publications
(18 citation statements)
references
References 83 publications
0
18
0
Order By: Relevance
“…Our concurrent work MDETR [31] employs DETR [15] to build an endto-end modulated detector and reason jointly over language and image. After the proposed VLT [32], transformer-based referring segmentation architectures receive more attention [33], [34], [35], [36], [37]. MaIL [33] follows the transformer architecture ViLT [38] and utilizes instance mask predicted by Mask R-CNN [39] as additional input.…”
Section: Referring Segmentationmentioning
confidence: 99%
“…Our concurrent work MDETR [31] employs DETR [15] to build an endto-end modulated detector and reason jointly over language and image. After the proposed VLT [32], transformer-based referring segmentation architectures receive more attention [33], [34], [35], [36], [37]. MaIL [33] follows the transformer architecture ViLT [38] and utilizes instance mask predicted by Mask R-CNN [39] as additional input.…”
Section: Referring Segmentationmentioning
confidence: 99%
“…Following the unprecedented success in natural language tasks, transformers [48] have also made great achievements in image recognition tasks recently. Te ViT model has become very popular in various computer vision tasks including image classifcation [16], image detection [49], image segmentation [50], and so on. In the feld of natural image recognition, ViT and its derived instances have achieved state-of-the-art performance on several benchmark datasets.…”
Section: Visual Transformer In Medicinementioning
confidence: 99%
“…Based on this pipeline, some works [2,15,25,42] improved the performance by utilizing more powerful feature encoders and designing more ingenious fusion strategies (e.g., recurrent fusion). With the success of attention mechanisms in the communities of natural language processing and computer vision, follow-up works [8,21,29,43] attempted to model intra-modal and cross-modal relationships by adopting self-attention and cross-attention operations. For example, Ye et al [43] proposed Cross-Modal Self-Attention (CMSA) to highlight informative visual and linguistic elements.…”
Section: Related Workmentioning
confidence: 99%