Predicting the future behaviour of neighbouring agents is crucial for autonomous driving. This task is challenging, largely because of the diverse unobservable intent of each agent which is further complicated by the complex interaction possibilities between them. The authors propose a multi‐future Transformer framework that implicitly models the multi‐modal joint distribution by capturing the diverse interaction modes of the scene. To this end, a parallel interaction module is constructed, whereby each interaction block learns the joint agent–agent and agent–map interactions for possible future evolution. The model can perform likelihood estimation from the perspective of both the joint distribution of the scene and marginal distribution of each agent. Combined with the proposed scene‐level winner‐take‐all loss strategy complementary to the model architecture, the best performance is achieved for both target agent prediction and scene prediction tasks in a single model. To better utilise the scene context, comprehensive control experiments were conducted highlighting the importance of fine‐grained scene representation with content‐adaptive aggregation and late fusion of semantic attributes. The method, evaluated on the popular Argoverse forecasting dataset, outperformed previous methods while maintaining low model complexity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.