Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence 2020
DOI: 10.24963/ijcai.2020/132
|View full text |Cite
|
Sign up to set email alerts
|

Polar Relative Positional Encoding for Video-Language Segmentation

Abstract: In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
23
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 34 publications
(23 citation statements)
references
References 12 publications
0
23
0
Order By: Relevance
“…To solve these challenges and effectively align video with text, existing RVOS approaches [13,24,32] typically rely on complicated pipelines. In contrast, here we propose a simple, end-to-end Transformer-based approach to RVOS.…”
Section: Multimodal Transformermentioning
confidence: 99%
See 3 more Smart Citations
“…To solve these challenges and effectively align video with text, existing RVOS approaches [13,24,32] typically rely on complicated pipelines. In contrast, here we propose a simple, end-to-end Transformer-based approach to RVOS.…”
Section: Multimodal Transformermentioning
confidence: 99%
“…The RVOS task was originally introduced by Gavrilyuk et al [11], whose goal was to attain pixel-level segmentation of actors and their actions in video content. To effectively aggregate and align visual, temporal and lingual information from video and text, state-of-the-art approaches to RVOS typically rely on complicated pipelines [24,30,32,40,41]. Gavrilyuk et al [11] proposed an I3D-based [4] encoder-decoder architecture that generated dynamic filters from text features and convolved them with visual features to obtain the segmentation masks.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Current studies in the filed of RVOS are made mainly around the theme of building effective multimodal feature representations. Existing methods typically make use of dynamic convolutions [21,4] to adaptively generate convolutional filters that better respond to the referent, or leverage cross-modal attention [22,16] to compute the correlations among input visual and linguistic embeddings. However, these methods only approach RVOS on the grid level, ignoring the importance of object-level visual cues.…”
Section: Related Workmentioning
confidence: 99%