A reliable sense-and-avoid system is critical to enabling safe autonomous operation of unmanned aircraft. Existing sense-and-avoid methods often require specialized sensors that are too large or power intensive for use on small unmanned vehicles. This paper presents a method to estimate object distances based on visual image sequences, allowing for the use of low-cost, on-board monocular cameras as simple collision avoidance sensors. We present a deep recurrent convolutional neural network and training method to generate depth maps from video sequences. Our network is trained using simulated camera and depth data generated with Microsoft's AirSim simulator. Empirically, we show that our model achieves superior performance compared to models generated using prior methods. We further demonstrate that the method can be used for sense-and-avoid of obstacles in simulation. I. IntroductionEffective sense-and-avoid systems are necessary to safely integrate unmanned aircraft into the airspace [1]. Many systems require specialized sensors and extensive computational resources creating the challenge of adhering to aircraft size, weight, and power (SWaP) constraints [2]. Embedded digital cameras, such as those commonly installed in cell phones, are common low-SWaP sensors that can be easily accommodated on-board most small unmanned aircraft.Camera images cannot be used directly for sense-and-avoid because they do not provide the three-dimensional location of potential obstacles. In this paper, we present a method to estimate three-dimensional locations using image sequences from simple monocular cameras. Our method generates a relative depth map of each pixel in a camera field-of-view (FoV) in the direction normal to the image plane. The resulting depth maps can then be used in a variety of applications, such as Simultaneous Localization and Mapping (SLAM) or sense-and-avoid. This method does not require specialized sensors allowing it to be used on SWaP-constrained vehicles where other systems are infeasible.The proposed method uses a deep neural network to map visual image sequences to corresponding relative depth maps. In order to account for the correlation between sequential input frames, we propose a recurrent convolutional neural network (R-CNN) architecture [3]. We present this general architecture and recommend an auto-encoder design based on convolutional Gated Recurrent Units (C-GRUs). In addition, we present a method to effectively train the network over image sequences using stochastic mini-batches.We demonstrate the effectiveness of the depth extraction approach in Microsoft's AirSim simulator [4]. Using AirSim, we generate matched-pair sets of images from a simulated on-board camera and the depth map of the scene in the field of view. We provide qualitative examples of the depth maps generated by our method and quantitative evaluations using conventional metrics from the field of computer vision. Our method outperforms three previously proposed deep learning-based methods. We also show that the accuracy of th...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.