Human's affective state recognition remains a challenging topic due to the complexity of emotions, which involves experiential, behavioral, and physiological elements. Since it is difficult to comprehensively describe emotion in terms of single modalities, recent studies have focused on fusion strategy to exploit the complementarity of multimodal signals. In this article, we study the feasibility of fusing facial expressions with physiological cues on human emotion recognition accuracy. The contributions of this work are threefold: 1) We propose a new spatiotemporal network for facial expression recognition using a 3D squeeze and exitation based 3D Xception architecture (squeeze and exitation Xception network). 2) We adopt the first multiple modalities fusion using single input source which, to the best of our knowledge, no existing multimodal emotion recognition system has attempted to identify emotional state from only facial videos using facial expressions and physiological signals features. 3) We compare the performance of the unimodal approach using only facial expressions or physiological data, to multimodal systems fusing facial expressions with video-based physiological cues. In our experiments, physiological signals such as the iPPG signal and features of heart rate variability measured remotely using the imaging photoplethysmography (iPPG) method are used. The preliminary results show that the multimodal fusion model improves the accuracy of emotion recognition, and merging facial expressions features with iPPG signal gives the best accuracy with 71.90 %.