Before their massive deployment, autonomous vehicles need to prove in complex scenarios such that they can reach human driving proficiency while guaranteeing higher safety levels. One of the most important human traits to negotiating traffic is the ability to predict the future behavior of surrounding vehicles as a basis for agile and safe navigation. This capability is particularly challenging for an autonomous system in highly interactive driving situations, such as intersections or roundabouts. In this paper, a set of techniques to bring a computationally expensive state-of-the-art motion prediction algorithm to real-time execution are presented with the goal of meeting a standard motion-planning algorithm execution frequency of 5 Hz, which is the primary consumer of motion predictions. This is achieved by applying novel and existing parallelization algorithms that take advantage of graphic processing units (GPUs) through the compute unified device architecture (CUDA) programming language and managing to produce an average 5× speedup over raw C++ in the cases studied. The optimizations are then evaluated in public datasets and a real vehicle on a test track.