2023
DOI: 10.1109/tmm.2022.3163578
|View full text |Cite
|
Sign up to set email alerts
|

Instance-Specific Feature Propagation for Referring Segmentation

Abstract: We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, whic… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
3

Relationship

2
8

Authors

Journals

citations
Cited by 32 publications
(12 citation statements)
references
References 56 publications
0
12
0
Order By: Relevance
“…It outperforms LAVT in most datasets slightly, but our method still outperforms it in UNC, UNC+ and G-Ref(Google) datasets. We compare our method with 11 state-of-the-art referring image segmentation methods in the validation set of UNC from 2019-2022, including CMSA [13], STEP [59], CMPC [11], LSCM [39], MCN [6], LELA-CLAF [5], CEFNet [42], VLT [55], ISF-FPN [56], CRIS [57] and LAVT [19]. Table 2 shows the qualitative comparison results using Prec@X and mIoU in UNC dataset.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…It outperforms LAVT in most datasets slightly, but our method still outperforms it in UNC, UNC+ and G-Ref(Google) datasets. We compare our method with 11 state-of-the-art referring image segmentation methods in the validation set of UNC from 2019-2022, including CMSA [13], STEP [59], CMPC [11], LSCM [39], MCN [6], LELA-CLAF [5], CEFNet [42], VLT [55], ISF-FPN [56], CRIS [57] and LAVT [19]. Table 2 shows the qualitative comparison results using Prec@X and mIoU in UNC dataset.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…• Referring VOS. Referring video object segmentation [44,45,46,47,48,49,50] is an emerging setting that involves multi-modal information. It gives a natural language expression to indicate the target object and aims at segmenting the target object throughout the video clips.…”
Section: Video Object Segmentation (Vos)mentioning
confidence: 99%
“…Model Pretraining. Different from model pretraining of 2D vision tasks [17,[21][22][23][40][41][42]46,59], recent researches on deep model pretraining for understanding 3D objects mainly focus on unsupervised learning methods. Typical unsupervised tasks include self-reconstruction [2,24,28,30,76] and self-supervised learning pretext tasks [25,54,57,66,74].…”
Section: Deep Learning For 3d Object Understandingmentioning
confidence: 99%