In this paper, we present a comprehensive assessment of individuals’ mental engagement states during manual and autonomous driving scenarios using a driving simulator. Our study employed two sensor fusion approaches, combining the data and features of multimodal signals. Participants in our experiment were equipped with Electroencephalogram (EEG), Skin Potential Response (SPR), and Electrocardiogram (ECG) sensors, allowing us to collect their corresponding physiological signals. To facilitate the real-time recording and synchronization of these signals, we developed a custom-designed Graphical User Interface (GUI). The recorded signals were pre-processed to eliminate noise and artifacts. Subsequently, the cleaned data were segmented into 3 s windows and labeled according to the drivers’ high or low mental engagement states during manual and autonomous driving. To implement sensor fusion approaches, we utilized two different architectures based on deep Convolutional Neural Networks (ConvNets), specifically utilizing the Braindecode Deep4 ConvNet model. The first architecture consisted of four convolutional layers followed by a dense layer. This model processed the synchronized experimental data as a 2D array input. We also proposed a novel second architecture comprising three branches of the same ConvNet model, each with four convolutional layers, followed by a concatenation layer for integrating the ConvNet branches, and finally, two dense layers. This model received the experimental data from each sensor as a separate 2D array input for each ConvNet branch. Both architectures were evaluated using a Leave-One-Subject-Out (LOSO) cross-validation approach. For both cases, we compared the results obtained when using only EEG signals with the results obtained by adding SPR and ECG signals. In particular, the second fusion approach, using all sensor signals, achieved the highest accuracy score, reaching 82.0%. This outcome demonstrates that our proposed architecture, particularly when integrating EEG, SPR, and ECG signals at the feature level, can effectively discern the mental engagement of drivers.