Training spiking recurrent neural networks on neuronal recordings or behavioral tasks has become a popular way to study computations performed by the nervous system. As the size and complexity of neural recordings increase, there is a need for efficient algorithms that can train models in a short period of time using minimal resources. We present optimized CPU and GPU implementations of the recursive least-squares algorithm in spiking neural networks. The GPU implementation can train networks of one million neurons, with 100 million plastic synapses and a billion static synapses, about 1,000 times faster than an unoptimized reference CPU implementation. We demonstrate the code's utility by training a network, in less than an hour, to reproduce the activity of > 66, 000 recorded neurons of a mouse performing a decision-making task. The fast implementation enables a more interactive in-silico study of the dynamics and connectivity underlying multi-area computations. It also admits the possibility to train models as in-vivo experiments are being conducted, thus closing the loop between modeling and experiments.