<div>The uncertainty of a driver’s state, the variability of the traffic environment,
and the complexity of road conditions have made driving behavior a critical
factor affecting traffic safety. Accurate predicting of driving behavior is
therefore crucial for ensuring safe driving. In this research, an efficient
framework, distilled routing transformer (DRTR), is proposed for driving
behavior prediction using multiple modality data, i.e., front view video frames
and vehicle signals. First, a cross-modal attention distiller is introduced,
which distills the cross-modal attention knowledge of a fusion-encoder
transformer to guide the training of our DRTR and learn deep interactions
between different modalities. Second, since the multi-modal learning usually
requires information from the macro view to the micro view, a self-attention
(SA)-routing module is custom-designed for SA layers in DRTR for dynamic
scheduling of global and local attentions for each input instance. Finally, a
Mogrifier long short-term memory (Mogrifier LSTM) network is employed for DRTR
to predict driving behaviors. We applied our approach to real-world data
collected during drives in both urban and freeway environments by an
instrumented vehicle. The experimental results demonstrate that the DRTR can
predict the imminent driving behavior effectively while enjoying a faster
inference speed than other state-of-the-art (SOTA) baselines.</div>