2022
DOI: 10.48550/arxiv.2211.09108
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Robust Online Video Instance Segmentation with Track Queries

Abstract: Recently, transformer-based methods have achieved impressive results on Video Instance Segmentation (VIS). However, most of these top-performing methods run in an offline manner by processing the entire video clip at once to predict instance mask volumes. This makes them incapable of handling the long videos that appear in challenging new video instance segmentation datasets like UVO and OVIS. We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 53 publications
0
4
0
Order By: Relevance
“…However, tracking in complex videos presents significant challenges, such as short or long-term occlusion, instance appearance or disappearance, and ID switching. We observe that the propagation strategy of [20,11,29] is helpful when consecutive frames share a considerable similarity. However, in abrupt changes, propagation might accumulate errors like lost tracks, ID switches, or inconsistent masks.…”
Section: Introductionmentioning
confidence: 91%
See 3 more Smart Citations
“…However, tracking in complex videos presents significant challenges, such as short or long-term occlusion, instance appearance or disappearance, and ID switching. We observe that the propagation strategy of [20,11,29] is helpful when consecutive frames share a considerable similarity. However, in abrupt changes, propagation might accumulate errors like lost tracks, ID switches, or inconsistent masks.…”
Section: Introductionmentioning
confidence: 91%
“…The recent emergence of datasets [16,24] containing lengthy and occluded videos has presented more challenging, real-world scenarios for VIS, driving the development of advanced deployable online models, particularly Detection-Transformer (DETR) [2,30] variants. These models primarily rely on frame-level detection and inter-frame association facilitated through tracking [22,8] or query propagation [20,11,6,29].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations