“…Thus, many works have focused on data association, aiming to exploit similarity cues such as visual appearance [4,18,34,44,52,62,68,72], 2D object motion [5,6,17,28,71] or 3D object motion [27,42,46,49,50,64] most effectively. Recently, researchers have focused on learning data association with graph neural networks [7,63] or transformers [43,65,76,81]. However, those works dismiss a more profound problem in the tracking-by-detection pipeline that precedes data association: Contemporary object detectors [25,38,[57][58][59] are designed for closed-set scenarios where all objects appear frequently in the training and testing data distributions.…”