Research on autonomous cars, early intensified in the 1990s, is becoming one of the main research paths in automotive industry. Recent works use Rapidly-exploring Random Trees to explore the state space along a given reference path, and to compute the minimum time collision-free path in real time. Those methods do not require good approximations of the reference path, they are able to cope with discontinuous routes, they are capable of navigating in realistic traffic scenarios, and they derive their power from an extensive computational effort directed to improve the quality of the trajectory from step to step. In this paper, we focus on re-engineering an existing state-of-the-art sequential algorithm to obtain a CUDA-based GPGPU (General Purpose Graphics Processing Units) implementation. To do that, we show how to partition the original algorithm among several working threads running on the GPU, how to propagate information among threads, and how to synchronize those threads. We also give detailed evidence on how to organize memory transfers between the CPU and the GPU (and among different CUDA kernels) such that planning times are optimized and the available memory is not exceeded while storing massive amounts of fuse data. To sum up, in our application the GPU is used for all main operations, the entire application is developed in the CUDA language, and specific attention is paid to concurrency, synchronization, and data communication. We run experiments on several real scenarios, comparing the GPU implementation with the CPU one in terms of the quality of the generated paths and in terms of computation (wall-clock) times. The results of our experiments show that embedded GPUs can be used as an enabler for real-time applications of computationally expensive planning approaches.