YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

Real, Esteban; Shlens, Jonathon; Mazzocchi, Stefano; Pan, Xin; Vanhoucke, Vincent

doi:10.1109/cvpr.2017.789

Cited by 556 publications

(295 citation statements)

References 45 publications

Supporting

Mentioning

295

Contrasting

Order By: Relevance

“…We train a version of our tracker with the ResNet-50 backbone using only the ImageNet VID [31], TrackingNet [25] and COCO [22] datasets. We compare this version, denoted DiMP-50-data with the state-ofthe-art Siamese tracker, SiamRPN++ [20], trained using Im-ageNet VID, YouTube-BB [29], COCO and ImageNet DET (c) UAV123 Figure S3. Success plots on NFS (a), OTB-100 (b), and UAV123 (c) datasets.…”

Section: S6 Impact Of Training Datamentioning

confidence: 99%

Learning Discriminative Model Prediction for Tracking

Bhat

Danelljan

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

1,151

1,035

View full text Add to dashboard Cite

The current strive towards end-to-end trainable computer vision systems imposes major challenges for the task of visual tracking. In contrast to most other vision problems, tracking requires the learning of a robust target-specific appearance model online, during the inference stage. To be end-to-end trainable, the online learning of the target model thus needs to be embedded in the tracking architecture itself. Due to these difficulties, the popular Siamese paradigm simply predicts a target feature template. However, such a model possesses limited discriminative power due to its inability of integrating background information.We develop an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. Our architecture is derived from a discriminative learning loss by designing a dedicated optimization process that is capable of predicting a powerful model in only a few iterations. Furthermore, our approach is able to learn key aspects of the discriminative loss itself. The proposed tracker sets a new state-of-the-art on 6 tracking benchmarks, achieving an EAO score of 0.440 on VOT2018, while running at over 40 FPS.

show abstract

Section: S6 Impact Of Training Datamentioning

confidence: 99%

Learning Discriminative Model Prediction for Tracking

Bhat

Danelljan

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

1,151

1,035

View full text Add to dashboard Cite

show abstract

“…The OxUvA [33] long-term dataset consists of 366 object tracks in 337 videos, which are carefully selected from the YTBB [27] dataset and sparsely labled at a frequency of 1Hz. Compared with the popular short-term tracking dataset (such as OTB2015), this dataset has many longterm videos (each video lasts for average 2.4 minutes) and includes severe out-of-view and full occlusion challenges.…”

Section: Results On Oxuvamentioning

confidence: 99%

‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking

Yan

Zhao

Wang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

193

109

View full text Add to dashboard Cite

Compared with traditional short-term tracking, longterm tracking poses more challenges and is much closer to realistic applications. However, few works have been done and their performance have also been limited. In this work, we present a novel robust and real-time longterm tracking framework based on the proposed skimming and perusal modules. The perusal module consists of an effective bounding box regressor to generate a series of candidate proposals and a robust target verifier to infer the optimal candidate with its confidence score. Based on this score, our tracker determines whether the tracked object being present or absent, and then chooses the tracking strategies of local search or global search respectively in the next frame. To speed up the image-wide global search, a novel skimming module is designed to efficiently choose the most possible regions from a large number of sliding windows. Numerous experimental results on the VOT-2018 long-term and OxUvA long-term benchmarks demonstrate that the proposed method achieves the best performance and runs in real-time. The source codes are available at https://github.com/iiau-tracker/SPLT.

show abstract

“…Finally, we utilize more unlabeled videos for network training. These additional raw videos are from the OxUvA benchmark [48] (337 videos in total), which is a subset of Youtube-BB [41]. In Fig.…”

Section: Ablation Study and Analysismentioning

confidence: 99%

Unsupervised Deep Tracking

Wang

Song

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

415

227

View full text Add to dashboard Cite

We propose an unsupervised visual tracking method in this paper. Different from existing approaches using extensive annotated data for supervised learning, our CNN model is trained on large-scale unlabeled videos in an unsupervised manner. Our motivation is that a robust tracker should be effective in both the forward and backward predictions (i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame). We build our framework on a Siamese correlation filter network, which is trained using unlabeled raw videos. Meanwhile, we propose a multiple-frame validation method and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of fully supervised trackers, which require complete and accurate labels during training. Furthermore, unsupervised framework exhibits a potential in leveraging unlabeled or weakly labeled data to further improve the tracking accuracy. * Y. Song and W. Liu are the corresponding authors. This work is done when N. Wang is an intern in Tencent AI Lab. The source code and results are available at https://github.com/594422814/UDT. Supervised Training: Annotated sequences Forward tracking Unsupervised Training: Unlabeled sequences Forward and Backward tracking

show abstract

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

Cited by 556 publications

References 45 publications

Learning Discriminative Model Prediction for Tracking

Learning Discriminative Model Prediction for Tracking

‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking

Unsupervised Deep Tracking

Contact Info

Product

Resources

About