A deep fusion model is proposed for facial expression-based human-computer Interaction system. Initially, image preprocessing, i.e., the extraction of the facial region from the input image is utilized. Thereafter, the extraction of more discriminative and distinctive deep learning features is achieved using extracted facial regions. To prevent overfitting, in-depth features of facial images are extracted and assigned to the proposed convolutional neural network (CNN) models. Various CNN models are then trained. Finally, the performance of each CNN model is fused to obtain the final decision for the seven basic classes of facial expressions, i.e., fear, disgust, anger, surprise, sadness, happiness, neutral. For experimental purposes, three benchmark datasets, i.e., SFEW, CK+, and KDEF are utilized. The performance of the proposed system is compared with some state-of-the-art methods concerning each dataset. Extensive performance analysis reveals that the proposed system outperforms the competitive methods in terms of various performance metrics. Finally, the proposed deep fusion model is being utilized to control a music player using the recognized emotions of the users.