2021
DOI: 10.48550/arxiv.2111.14821
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Abstract: The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(11 citation statements)
references
References 30 publications
0
11
0
Order By: Relevance
“…ClawCraneNet [16] leverages cross-modal attention to bridge the semantic correlation between textual and visual modalities. ReferFormer [41] and MTTR [1] are two latest works that utilize transformers to decode or fuse multimodal features. ReferFormer [41] employs a linguistic prior to the transformer decoder to focus on the referred object.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…ClawCraneNet [16] leverages cross-modal attention to bridge the semantic correlation between textual and visual modalities. ReferFormer [41] and MTTR [1] are two latest works that utilize transformers to decode or fuse multimodal features. ReferFormer [41] employs a linguistic prior to the transformer decoder to focus on the referred object.…”
Section: Related Workmentioning
confidence: 99%
“…ReferFormer [41] employs a linguistic prior to the transformer decoder to focus on the referred object. MTTR [1] leverages a multimodal transformer encoder to fuse linguistic and visual features. Different from other vision-language tasks, e.g., image-text retrieval [20,22,29] and video question answering [12,37], R-VOS needs to construct object-level multimodal semantic consensus in a dense visual representation.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations