“…Current mainstream trackers [10,11,12,129,130,131,132,133,134,135] adopt tracking-by-detection (TBD) by first performing per-frame detection and then associating the detected boxes in the temporal dimension. Current works [13,136,137,138,139] leverage trajectories or tubes to capture motion trails of targets. MOT variants include e.g., video object segmentation (VOS) [14,15], video instance segmentation (VIS) [16], multi-object tracking and segmentation (MOTS) [140] and video panoptic segmentation (VPS) [141].…”