We propose a multi-task learning framework for improving the performance of vision-based deep-learning approaches for driver distraction recognition. The most popular tool so far for solving this task is convolutional neural networks (CNNs) that have proven to be strongly biased toward local features. Such bias causes CNNs to neglect global structural information, adversely affecting the robustness of the distracted driver recognition task. To solve this problem, we generate positive and negative samples of each given input, and construct a triplet of images (i.e., raw image, positive sample, and negative sample). The positive sample is generated by applying structure-aware illumination to the human body region of each given input. The negative sample is generated by randomly shuffling the local regions of each given input. The networks are then trained with the triplets using a multi-task learning strategy to force the networks to explore global information by multiple tasks: (a) recognizing the raw input and positive sample as the given ground truth; (b) recognizing the negative sample as an extra "meaningless" label; (c) pulling closer the distance between the features obtained from the raw input and positive sample while pushing away the distance between the features obtained from the raw input and negative sample. By doing so, the model can be trained so that it neglects the background information and pays more attention to the global structual information of the scene. The proposed approach reaches state-of-the-art performance on the AUC Distracted Driver Dataset and performs better than state-of-the-art studies on the Drive and Act Dataset. With raw images as input, we have achieved an accuracy of 96.0% for the AUC distracted driver dataset and 66.8% for the Drive and Act Dataset. Our approach does not introduce extra overhead during the testing procedure (i.e., utilization procedure), which is helpful for real-life applications. Moreover, better accuracy can be achieved by fusing the predictions respectively obtained with the raw input and positive sample. As a result, we have achieved an accuracy of 96.3% for the AUC distracted driver Dataset and 66.9% for the Drive and Act Dataset. The class activation map (CAM) of our proposed method is subjectively more reasonable, which would enhance the reliability and explainability of the model.
INDEX TERMSAction recognition, Advanced driver assistance, Contrastive learning, Multi-task learning, Intelligent vehicles I. INTRODUCTIONNowadays, distracted driving has become a huge threat to human society. According to the report issued by the National Highway Traffic Safety Administration (NHTSA) in the United State in 2019, traffic accidents caused by distracted driving led to 3,142 or 8.7 percent of all accidents of this year in the United States [1], and most of them were involved in texting or talking on mobile phones. Owing to this situation, a reduction in traffic accidents can be realized if we can develop distracted driving detectors. Such detectors can be use...