2022
DOI: 10.1007/s11042-022-13413-x
|View full text |Cite
|
Sign up to set email alerts
|

A closer look at referring expressions for video object segmentation

Abstract: The task of Language-guided Video Object Segmentation (LVOS) aims at generating binary masks for an object referred by a linguistic expression. When this expression unambiguously describes an object in the scene, it is named referring expression (RE). Our work argues that existing benchmarks used for LVOS are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the referring expressions in the DAVIS-2017 and Actor-Action data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 17 publications
(2 citation statements)
references
References 43 publications
0
2
0
Order By: Relevance
“…The first stage extracts language and image feature, respectively, and the second stage fuses the multi-modal features to predict the segmentation result. In the first stage, these methods usually use ResNet [5,34], DeeplabV3 [28,35,36] or Darknet [6] to extract image feature, and use LSTM [9,10,36], simple recurrent units [37] or Transformer [35,38] to extract language features. In the second stage, most methods construct cross-modal decoder to fuse image and language features.…”
Section: Referring Image Segmentationmentioning
confidence: 99%
“…The first stage extracts language and image feature, respectively, and the second stage fuses the multi-modal features to predict the segmentation result. In the first stage, these methods usually use ResNet [5,34], DeeplabV3 [28,35,36] or Darknet [6] to extract image feature, and use LSTM [9,10,36], simple recurrent units [37] or Transformer [35,38] to extract language features. In the second stage, most methods construct cross-modal decoder to fuse image and language features.…”
Section: Referring Image Segmentationmentioning
confidence: 99%
“…Image information undergoes contextual self calibration structure operations to obtain deeper features, including a large number of spatial and semantic features [28][29]. However, this approach solely extracts contextual features, resulting in incomplete information features, leading to significant feature information waste, ultimately compromising the network's output.…”
mentioning
confidence: 99%