Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehiclepassenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e. g., vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information ś for example, the vehicle cabin environment ś adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE , attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03 % in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72 % on the test set.
CCS CONCEPTS• Computing methodologies → Computer vision; Scene understanding; • Human-centered computing → Human computer interaction (HCI).