In automatic facial expression recognition (AFER) systems, modelling the spatio‐temporal feature information in a specific manner, coalescing, and its effective utilization is challenging. The state‐of‐the‐art studies have examined integrating multiple features to enhance the recognition rate of AFER systems. However, the feature variations between expressive and neutral face images are not fully explored to identify the expression class. The proposed research presents an innovative approach to AFER by modelling appearance variations in both expressive and neutral face images. The prominent contributions of the work are developing a novel and hybrid feature space by integrating the discriminative feature distribution derived from expressive and neutral face images; preserving the highly discriminative latent feature distribution using autoencoders. Local binary pattern (LBP) and histogram of oriented gradients (HOG) are the feature descriptors employed to derive the discriminative texture and shape information, respectively. The component‐based approach is employed, wherein the features are derived from the salient facial regions instead of the whole face. The three‐stage stacked deep convolutional autoencoder (SDCA) and multi‐class support vector machine (MSVM) are employed to address dimensionality reduction and classification, respectively. The efficacy of the proposed model is substantiated by empirical findings, which establish its superiority in terms of accuracy in AFER tasks on widely recognized benchmark datasets.