Person detection from vehicles has made rapid progress recently with the advent of multiple high‐quality datasets of urban and highway driving, yet no large‐scale benchmark is available for the same problem in off‐road or agricultural environments. Here we present the National Robotics Engineering Center (NREC) Agricultural Person‐Detection Dataset to spur research in these environments. It consists of labeled stereo video of people in orange and apple orchards taken from two perception platforms (a tractor and a pickup truck), along with vehicle position data from Real Time Kinetic (RTK) GPS. We define a benchmark on part of the dataset that combines a total of 76k labeled person images and 19k sampled person‐free images. The dataset highlights several key challenges of the domain, including varying environment, substantial occlusion by vegetation, people in motion and in nonstandard poses, and people seen from a variety of distances; metadata are included to allow targeted evaluation of each of these effects. Finally, we present baseline detection performance results for three leading approaches from urban pedestrian detection and our own convolutional neural network approach that benefits from the incorporation of additional image context. We show that the success of existing approaches on urban data does not transfer directly to this domain.
Accurate layout estimation is crucial for planning and navigation in robotics applications, such as self-driving. In this paper, we introduce the Stereo Bird's Eye View Network (SBEVNet), a novel supervised end-to-end framework for estimation of bird's eye view layout from a pair of stereo images. Although our network reuses some of the building blocks from the state-of-the-art deep learning networks for disparity estimation, we show that explicit depth estimation is neither sufficient nor necessary. Instead, the learning of a good internal bird's eye view feature representation is effective for layout estimation. Specifically, we first generate a disparity feature volume using the features of the stereo images and then project it to the bird's eye view coordinates. This gives us coarse-grained information about the scene structure. We also apply inverse perspective mapping (IPM) to map the input images and their features to the bird's eye view. This gives us fine-grained texture information. Concatenating IPM features with the projected feature volume creates a rich bird's eye view representation which is useful for spatial reasoning. We use this representation to estimate the BEV semantic map. Additionally, we show that using the IPM features as a supervisory signal for stereo features can give an improvement in performance. We demonstrate our approach on two datasets: the KITTI (Geiger et al., 2013) dataset and a synthetically generated dataset from the CARLA (Dosovitskiy et al., 2017) simulator. For both of these datasets, we establish state-of-the-art performance compared to baseline techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.