2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.789
|View full text |Cite
|
Sign up to set email alerts
|

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

Abstract: We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precisi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
295
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 556 publications
(295 citation statements)
references
References 45 publications
0
295
0
Order By: Relevance
“…We train a version of our tracker with the ResNet-50 backbone using only the ImageNet VID [31], TrackingNet [25] and COCO [22] datasets. We compare this version, denoted DiMP-50-data with the state-ofthe-art Siamese tracker, SiamRPN++ [20], trained using Im-ageNet VID, YouTube-BB [29], COCO and ImageNet DET (c) UAV123 Figure S3. Success plots on NFS (a), OTB-100 (b), and UAV123 (c) datasets.…”
Section: S6 Impact Of Training Datamentioning
confidence: 99%
“…We train a version of our tracker with the ResNet-50 backbone using only the ImageNet VID [31], TrackingNet [25] and COCO [22] datasets. We compare this version, denoted DiMP-50-data with the state-ofthe-art Siamese tracker, SiamRPN++ [20], trained using Im-ageNet VID, YouTube-BB [29], COCO and ImageNet DET (c) UAV123 Figure S3. Success plots on NFS (a), OTB-100 (b), and UAV123 (c) datasets.…”
Section: S6 Impact Of Training Datamentioning
confidence: 99%
“…The OxUvA [33] long-term dataset consists of 366 object tracks in 337 videos, which are carefully selected from the YTBB [27] dataset and sparsely labled at a frequency of 1Hz. Compared with the popular short-term tracking dataset (such as OTB2015), this dataset has many longterm videos (each video lasts for average 2.4 minutes) and includes severe out-of-view and full occlusion challenges.…”
Section: Results On Oxuvamentioning
confidence: 99%
“…Finally, we utilize more unlabeled videos for network training. These additional raw videos are from the OxUvA benchmark [48] (337 videos in total), which is a subset of Youtube-BB [41]. In Fig.…”
Section: Ablation Study and Analysismentioning
confidence: 99%