TAO: A Large-Scale Benchmark for Tracking Any Object

Dave, Achal; Khurana, Tarasha; Tokmakov, Pavel; Schmid, Cordelia; Ramanan, Deva

doi:10.1007/978-3-030-58558-7_26

Cited by 117 publications

(108 citation statements)

References 75 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…There are many datasets focusing on more diverse object categories than person and vehicles. The ImageNet-Vid [12] benchmark provides trajectory annotations for 30 object categories in over 1000 videos and TAO [10] annotates even 833 object categories to study object tracking on long-tailed distribution.…”

Section: Related Workmentioning

confidence: 99%

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Sun¹,

Cao²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Figure 1 -Sample images from a video in DanceTrack. The shown images are 1, 66, 307 and 327 frames in DanceTrack0027 video. The emphasized properties of this dataset are (1) uniform appearance: humans are in highly similar and almost undistinguished appearance.(2) diverse motion: they are in complicated motion pattern and interaction. The numbers below show their identification which experiences frequent relative position switches and occlusion as well. We expect the combination of uniform appearance and complicated motion pattern makes DanceTrack a platform to encourage more comprehensive and intelligent multi-object tracking algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Sun¹,

Cao²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…There are a number of public datasets with box-level annotations for different video tasks: ImageNet-VID [142] for video object detection; LaSOT [115], GOT-10k [143], Youtube-BB [144], and TrackingNet [145] for single ob- ject tracking; MOT [146], TAO [147], Youtube-VOS [15] and Youtube-VIS [16] for multi-object tracking. However, none of these datasets meet the requirement of our proposed few-shot video object detection task.…”

Section: Dataset Collectionmentioning

confidence: 99%

“…To save human annotation effort as much as possible, rather than building our dataset from scratch, we exploit existing large-scale video datasets for supervised learning, i.e., LaSOT [115], GOT-10k [143], and TAO [147] to construct our dataset subject to the above three criteria by: Dataset Filtering. Note that the above datasets cannot be directly used since they are only partially annotated for tracking task: although multiple objects of a given class are present in the video, only some or even one of them is annotated while others may be ignored.…”

Section: Dataset Collectionmentioning

confidence: 99%

Few-Shot Video Object Detection

Fan¹,

Tang

Tai³

2021

Preprint

View full text Add to dashboard Cite

We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500 classes with classbalanced videos in each category for few-shot learning; 2) a novel Tube Proposal Network (TPN) to generate highquality video tube proposals to aggregate feature representation for the target video object; 3) a strategically improved Temporal Matching Network (TMN+) to match representative query tube features and supports with better discriminative ability. Our TPN and TMN+ are jointly and end-to-end trained. Extensive experiments demonstrate that our method produces significantly better detection results on two few-shot video object detection datasets compared to image-based methods and other naive video-based extensions. Codes and datasets will be released at https: //github.com/fanq15/FewX.

show abstract

“…For video object identification, we require video object sequences where objects are associated across multiple frames. Hence, to train and evaluate our proposed approach, we used four video instance segmentation datasets: YouTube Video Instance Segmentation (YT-VIS) [51], Unidentified Video Objects (UVO) [47], Occluded Video Instance Segmentation (OVIS) [34], and Tracking Any Object with Video Object Segmentation (TAO-VOS) [8,43]. All these datasets contain a large object vocabulary and various challenging scenarios, including perceptually-aliased occluded objects, as described below:…”

Section: Datasetsmentioning

confidence: 99%

“…4) TAO-VOS: This dataset is a subset of the Tracking Any Object (TAO) dataset [8] with masks for video object segmentation. TAO is a benchmark federated object tracking dataset comprising videos from 7 datasets captured in diverse environments.…”

Section: Datasetsmentioning

confidence: 99%

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Keetha¹,

Wang²,

Qiu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and re-localization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a "fixed" partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally "evolving" global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-ofthe-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods.

show abstract

TAO: A Large-Scale Benchmark for Tracking Any Object

Cited by 117 publications

References 75 publications

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Few-Shot Video Object Detection

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Contact Info

Product

Resources

About