In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a largescale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a highquality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.
Human action recognition is a constantly evolving field that is driven by numerous applications. In recent years, significant progress has been made in this area due to the development of advanced representation learning techniques. Despite this progress, human action recognition still poses significant challenges, particularly due to the unpredictable variations in the visual appearance of an image sequence. To address these challenges, we propose the fine-tuned temporal dense sampling with 1D convolutional neural network (FTDS-1DConvNet). Our method involves the use of temporal segmentation and temporal dense sampling, which help to capture the most important features of a human action video. First, the human action video is partitioned into segments through temporal segmentation. Each segment is then processed through a fine-tuned Inception-ResNet-V2 model, where max pooling is performed along the temporal axis to encode the most significant features as a fixed-length representation. This representation is then fed into a 1DConvNet for further representation learning and classification. The experiments on UCF101 and HMDB51 demonstrate that the proposed FTDS-1DConvNet outperforms the state-of-the-art methods, with a classification accuracy of 88.43% on the UCF101 dataset and 56.23% on the HMDB51 dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.