Simultaneous trajectory prediction for multiple heterogeneous traffic participants is essential for safe and efficient operation of connected automated vehicles under complex driving situations. Two main challenges for this task are to handle the varying number of heterogeneous target agents and jointly consider multiple factors that would affect their future motions. This is because different kinds of agents have different motion patterns, and their behaviors are jointly affected by their individual dynamics, their interactions with surrounding agents, as well as the traffic infrastructures. A trajectory prediction method handling these challenges will benefit the downstream decisionmaking and planning modules of autonomous vehicles.To meet these challenges, we propose a three-channel framework together with a novel Heterogeneous Edge-enhanced graph ATtention network (HEAT). Our framework is able to deal with the heterogeneity of the target agents and traffic participants involved. Specifically, agents' dynamics are extracted from their historical states using type-specific encoders. The inter-agent interactions are represented with a directed edgefeatured heterogeneous graph and processed by the designed HEAT network to extract interaction features. Besides, the map features are shared across all agents by introducing a selective gate-mechanism. And finally, the trajectories of multiple agents are predicted simultaneously. Validations using both urban and highway driving datasets show that the proposed model can realize simultaneous trajectory predictions for multiple agents under complex traffic situations, and achieve state-of-the-art performance with respect to prediction accuracy. The achieved final displacement error (FDE@3sec) is 0.66 meter under urban driving, demonstrating the feasibility and effectiveness of the proposed approach.
Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RVG-TREE: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees. RVG-TREE can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.