2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 2019
DOI: 10.1109/iccvw.2019.00034
|View full text |Cite
|
Sign up to set email alerts
|

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

Abstract: We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species appearance and motion signatures despite significant partial occlusion. We evaluate the framework using 500 camera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 16 publications
(19 citation statements)
references
References 29 publications
0
19
0
Order By: Relevance
“…YOLO is a faster, more efficient object detector, which may be more suited to video processing, while Faster R-CNN generally achieves higher accuracies, but is slower. RetinaNet was chosen as it achieves a good balance between the computational efficiency of YOLO and the accuracy of Faster R-CNN, which made it an appropriate choice for the difficult task of camera trap image processing (Yang et al, 2019). In this study, we have only demonstrated location invariance using RetinaNet.…”
Section: Discussionmentioning
confidence: 99%
“…YOLO is a faster, more efficient object detector, which may be more suited to video processing, while Faster R-CNN generally achieves higher accuracies, but is slower. RetinaNet was chosen as it achieves a good balance between the computational efficiency of YOLO and the accuracy of Faster R-CNN, which made it an appropriate choice for the difficult task of camera trap image processing (Yang et al, 2019). In this study, we have only demonstrated location invariance using RetinaNet.…”
Section: Discussionmentioning
confidence: 99%
“…The archive footage contains around 20K videos adding up to around 600 hours. We use a subset of 5219 videos, with 500 videos (totalling over 180K frames) manually annotated with per frame great ape location bounding boxes, species and further categories (Yang et al 2019;Sakib and Burghardt 2021). This labelled data is split into trainset, valset, testset at a ratio of 80%, 5%, 15% respectively.…”
Section: Methodsmentioning
confidence: 99%
“…Whilst crowd sourcing annotations can help, low labelling rates relative to archive sizes remain the norm in the field. For great apes in particular, several recent works have attempted to address some of the above mentioned challenges (Yang et al 2019;Schofield et al 2019;Sakib and Burghardt 2021;Bain et al 2021). However, these works still either only pretrain on datasets from other domains or rely on relatively small datasets for supervised training due to the complexities associated with obtaining annotations.…”
Section: Introductionmentioning
confidence: 99%
“…The method detects the un-annotated video frames, and uses adjacent frames to locate the object in the current frame. Yang et al add Temporal Context Module and Spatial Context Module into the multiple image object detectors, for the usage of detecting wild great apes [116]. Literature about the UG 2 + (UAV, Glider, Ground) challenge concludes that the methods with better video detection effect use spatiotemporal context method [117].…”
Section: Mixed-stage Video Object Detectionmentioning
confidence: 99%