<span>Recognizing human emotions simultaneously from multiple data modalities (e.g., face, and speech) has drawn significant research interest, and numerous research contributions have been investigated in the affective computing community. However, most methods concentrate less on facial alignment and keyframe selection for audio-visual input. Hence, this paper proposed a new audio-visual descriptor, mainly concentrating on describing the emotion through only a few frames. For this purpose, we proposed a new self-similarity distance matrix (SSDM), which computes the spatial, and temporal distances through landmark points on the facial image. The audio signal is described through an asset of composite features, including statistical features, spectral features, formant frequencies, and energies.Β A support vector machine (SVM) algorithm is employed to classify both models, and the final results are fused to predict the emotion. Surrey audio-visual expressed emotion (SAVEE) and Ryerson multimedia research lab (RML) datasets are utilized for experimental validation, and the proposed method has shown significant improvement from the state of art methods.</span>