Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers

He, Zhen; Li, Jian; Liu, Daxue; He, Hangen; Barber, David

doi:10.1109/cvpr.2019.00141

Cited by 37 publications

(61 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…position, size, appearance) over time. Similar extensions are also provided by (Hsieh et al 2018;He et al 2018). In an orthogonal direction, Spatially Invariant Attend, Infer, Repeat (SPAIR) (Crawford and Pineau 2019) improved on AIR's ability to handle cluttered scenes by replacing AIR's recurrent encoder network with a convolutional network and a spatially local object specification scheme.…”

Section: Related Workmentioning

confidence: 92%

“…We also experimented with Tracking by Animation (TbA) (He et al 2018), but were unable to obtain good tracking performance on these densely cluttered videos. One relevant point is that TbA lacks a means of encouraging the network to explain scenes using few objects, and we found TbA often using several internal objects to explain a single object in the video; in contrast, both SILOT and SQAIR use priors on o pres which encourage o pres to be near 0, forcing the networks to use internal objects efficiently.…”

Section: Scattered Mnistmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

Crawford

Pineau

2020

AAAI

View full text Add to dashboard Cite

The ability to detect and track objects in the visual world is a crucial skill for any intelligent agent, as it is a necessary precursor to any object-level reasoning process. Moreover, it is important that agents learn to track objects without supervision (i.e. without access to annotated training videos) since this will allow agents to begin operating in new environments with minimal human assistance. The task of learning to discover and track objects in videos, which we call unsupervised object tracking, has grown in prominence in recent years; however, most architectures that address it still struggle to deal with large scenes containing many objects. In the current work, we propose an architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local object specification scheme). In a series of experiments, we demonstrate a number of attractive features of our architecture; most notably, that it outperforms competing methods at tracking objects in cluttered scenes with many objects, and that it can generalize well to videos that are larger and/or contain more objects than videos encountered during training.

show abstract

Section: Related Workmentioning

confidence: 92%

Section: Scattered Mnistmentioning

confidence: 99%

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

Crawford

Pineau

2020

AAAI

View full text Add to dashboard Cite

show abstract

“…He et al proposed a tracking framework [116] in an end-to-end manner for using unlabeled data, and this framework includes Reprioritized Attentive Tracking with Tracking-By-Animation. Lee and Kim proposed a Feature Pyramid Siamese Network (FPSN) [117] to extract multi-level feature information and to add Spatio-temporal motion features to consider both appearance and motion information.…”

Section: Motion Variationsmentioning

confidence: 99%

Multiple Object Tracking in Deep Learning Approaches: A Survey

et al. 2021

View full text Add to dashboard Cite

Object tracking is a fundamental computer vision problem that refers to a set of methods proposed to precisely track the motion trajectory of an object in a video. Multiple Object Tracking (MOT) is a subclass of object tracking that has received growing interest due to its academic and commercial potential. Although numerous methods have been introduced to cope with this problem, many challenges remain to be solved, such as severe object occlusion and abrupt appearance changes. This paper focuses on giving a thorough review of the evolution of MOT in recent decades, investigating the recent advances in MOT, and showing some potential directions for future work. The primary contributions include: (1) a detailed description of the MOT’s main problems and solutions, (2) a categorization of the previous MOT algorithms into 12 approaches and discussion of the main procedures for each category, (3) a review of the benchmark datasets and standard evaluation methods for evaluating the MOT, (4) a discussion of various MOT challenges and solutions by analyzing the related references, and (5) a summary of the latest MOT technologies and recent MOT trends using the mentioned MOT categories.

show abstract

“…They tracked objects by fusing trajectory dynamics information, and proposed a novel two-step data association framework. He et al [26] proposed a tracking-by-animation framework to achieve both label-free and end-to-end learning for MOT, unlike tracking-by-detection frameworks, that isolate the detection task from the tracking task. Their differentiable neural network first tracks objects in input frames, and then animates the tracked objects in reconstructed frames.…”

Section: Related Workmentioning

confidence: 99%

OneShotDA: Online Multi-Object Tracker With One-Shot-Learning-Based Data Association

et al. 2020

View full text Add to dashboard Cite

Tracking multiple objects in a video sequence can be accomplished by identifying the objects appearing in the sequence and distinguishing between them. Therefore, many recent multi-object tracking (MOT) methods have utilized re-identification and distance metric learning to distinguish between objects by computing the similarity/dissimilarity scores. However, it is difficult to generalize such approaches for arbitrary video sequences, because some important information, such as the number of objects (classes) in a video, is not known in advance. Therefore, in this study, we applied a one-shot learning framework to the MOT problem. Our algorithm tracks objects by classifying newly observed objects into existing tracks, irrespective of the number of objects appearing in a video frame. The proposed method, called OneShotDA, exploits the one-shot learning framework based on an attention mechanism. Our neural network learns to classify unseen data samples using labels from a support set. Once the network has been trained, it predicts correct labels for newly received detection results based on the set of existing tracks. To analyze the effectiveness of our method, it was tested on the MOTchallenge benchmark datasets (MOT16 and MOT17 datasets). The results reveal that the performance of the proposed method was comparable with those of current state-of-the-art methods. In particular, it is noteworthy that the proposed method ranked first among the online trackers on the MOT17 benchmark.

show abstract

Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers

Cited by 37 publications

References 49 publications

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

Multiple Object Tracking in Deep Learning Approaches: A Survey

OneShotDA: Online Multi-Object Tracker With One-Shot-Learning-Based Data Association

Contact Info

Product

Resources

About