Unmanned Aerial Vehicles (UAV) can pose a major risk for aviation safety, due to both negligent and malicious use. For this reason, the automated detection and tracking of UAV is a fundamental task in aerial security systems. Common technologies for UAV detection include visible-band and thermal infrared imaging, radio frequency and radar. Recent advances in deep neural networks (DNNs) for imagebased object detection open the possibility to use visual information for this detection and tracking task. Furthermore, these detection architectures can be implemented as backbones for visual tracking systems, thereby enabling persistent tracking of UAV incursions. To date, no comprehensive performance benchmark exists that applies DNNs to visible-band imagery for UAV detection and tracking. To this end, three datasets with varied environmental conditions for UAV detection and tracking, comprising a total of 241 videos (331,486 images), are assessed using four detection architectures and three tracking frameworks. The best performing detector architecture obtains an mAP of 98.6% and the best performing tracking framework obtains a MOTA of 96.3%. Cross-modality evaluation is carried out between visible and infrared spectrums, achieving a maximal 82.8% mAP on visible images when training in the infrared modality. These results provide the first public multiapproach benchmark for state-of-the-art deep learning-based methods and give insight into which detection and tracking architectures are effective in the UAV domain.
Instrumentation Engineers (SPIE). One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this publication for a fee or for commercial purposes, and modication of the contents of the publication are prohibited.
Automatic detection for threat object items is an increasing emerging area of future application in X-ray security imagery. Although modern X-ray security scanners can provide two or more views, the integration of such object detectors across the views has not been widely explored with rigour. Therefore, we investigate the application of geometric constraints using the epipolar nature of multi-view imagery to improve object detection performance. Furthermore, we assume that images come from uncalibrated views, such that a method to estimate the fundamental matrix using ground truth bounding box centroids from multiple view object labels is proposed. In addition, detections are given a confidence probability based on its similarity with respect to the distribution of the distance to the epipolar line. This probability is used as confidence weights for merging duplicated predictions using non-maximum suppression. Using a standard object detector (YOLOv3), our technique increases the average precision of detection by 2.8% on a dataset composed of firearms, laptops, knives and cameras. These results indicate that the integration of images at different views significantly improves the detection performance of threat items of cluttered X-ray security images.
Object detection has been thoroughly investigated during the last decade using deep neural networks. However, the inclusion of additional information given by multiple concurrent views of the same scene has not received much attention. In scenarios where objects may appear in obscure poses from certain view points, the use of differing simultaneous views can improve object detection. Therefore, we propose a multi-view fusion network to enrich the backbone features of standard object detection architectures across multiple source and target view points. Our method consists of a transformer decoder for the target view that combines the remaining source views feature maps. In this way, the feature representation of the target view can aggregate feature information from the source view through attention. Our architecture is detector-agnostic, meaning it can be applied across any existing detection backbone. We evaluate performance using YOLOX, Deformable DETR and Swin Transformer baseline detectors, comparing standard single view performance against the addition of our multi-view transformer architecture. Our method achieves a 3% increase of the COCO AP over a four view X-ray security dataset and a slight 0.7% increase on a seven view pedestrian dataset. We demonstrate that the integration of different views using attention-based networks improves the detection performance of multi-view datasets. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.