2024
DOI: 10.1609/aaai.v38i6.28465
|View full text |Cite
|
Sign up to set email alerts
|

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Shilin Yan,
Renrui Zhang,
Ziyu Guo
et al.

Abstract: Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
references
References 30 publications
0
0
0
Order By: Relevance