Polar Relative Positional Encoding for Video-Language Segmentation

Ning, Ke; Xie, Lingxi; Wu, Fei

doi:10.24963/ijcai.2020/132

Cited by 34 publications

(23 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To solve these challenges and effectively align video with text, existing RVOS approaches [13,24,32] typically rely on complicated pipelines. In contrast, here we propose a simple, end-to-end Transformer-based approach to RVOS.…”

Section: Multimodal Transformermentioning

confidence: 99%

“…The RVOS task was originally introduced by Gavrilyuk et al [11], whose goal was to attain pixel-level segmentation of actors and their actions in video content. To effectively aggregate and align visual, temporal and lingual information from video and text, state-of-the-art approaches to RVOS typically rely on complicated pipelines [24,30,32,40,41]. Gavrilyuk et al [11] proposed an I3D-based [4] encoder-decoder architecture that generated dynamic filters from text features and convolved them with visual features to obtain the segmentation masks.…”

Section: Related Workmentioning

confidence: 99%

“…For a more effective representation than convolutions, VT-Capsule [30] encoded each modality in capsules [35], while ACGA [41] utilized a co-attention mechanism to enhance the multimodal features. To improve positional relation representations in the text, PRPE [32] explored a positional encoding mechanism based on the polar coordinate system. URVOS [37] improved tracking capabilities by performing language-based object segmentation using the key frame in the video and propagating the predicted mask throughout the video.…”

Section: Related Workmentioning

confidence: 99%

“…A suitable temporal encoder for the RVOS task should be able to extract both visual characteristics (e.g., shape, size, location) and action semantics for each instance in the video. Several previous works [11,24,32] utilized the Kinetics-400 [16] pre-trained I3D network [4] as their temporal encoder. However, since I3D was originally designed for action classification, using its outputs as-is for tasks that require fine details (e.g., instance segmentation) is not ideal as the features it outputs tend to suffer from spatial misalignment caused by temporal downsampling.…”

Section: Temporal Encodermentioning

confidence: 99%

See 3 more Smart Citations

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Zheltonozhskii¹,

Baskin²

2021

Preprint

View full text Add to dashboard Cite

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can both be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional maskrefinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is available at https://github.com/mttr2021/MTTR.

show abstract

Section: Multimodal Transformermentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Temporal Encodermentioning

confidence: 99%

See 2 more Smart Citations

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Zheltonozhskii¹,

Baskin²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Current studies in the filed of RVOS are made mainly around the theme of building effective multimodal feature representations. Existing methods typically make use of dynamic convolutions [21,4] to adaptively generate convolutional filters that better respond to the referent, or leverage cross-modal attention [22,16] to compute the correlations among input visual and linguistic embeddings. However, these methods only approach RVOS on the grid level, ignoring the importance of object-level visual cues.…”

Section: Related Workmentioning

confidence: 99%

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Liang,

Wu,

Zhou

et al. 2021

Preprint

View full text Add to dashboard Cite

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks 1 st place on CVPR2021 Referring Youtube-VOS challenge.

show abstract