The inability to efficiently store distinguishing edges, local appearance-based textured descriptions generally have limited performance in detecting facial expression analysis. The existing technology has certain drawbacks, such as the potential for poor edge-related disturbance in face photos and the reliance on present sets of characteristics that might fail to adequately represent the subtleties of emotions and thoughts in a variety of contexts. In order to overcome the difficulties associated with identifying facial expressions identification and emotion categorization, this study presents an innovative structure that combines three different information sets: a new multimedia descriptors, prosodic functions, and Local Differential Pattern (LDP). The principal driving force is the existence of noise-induced warped and weak edges in face pictures, which result in inaccurate expressions characteristic assessment. By identifying and encoding only greater edge reactions, as opposed to standard local descriptors that the LDP approach improves the endurance of face feature extraction. Robinson Compass and Kirsch Compass Masks are used for recognising edges, and the LDP formulation encodes each pixel with seven bits of information to reduce code repetition. The last category comprises Long-Term Average Spectrum (LTAS) obtained from signals related to speech, Mel-Frequency Cepstral Coefficients (MFCC), and Forming agents. Fisher Criterion is used to reduce dimensionality, and unpredictable characteristics are used in picking features. Emotion prediction is achieved by classifying two distinct circumstances using Support Vector Machine (SVM) and Decision Tree (DT) algorithms, and combining the obtained data. The research also presents a unique Visual or audio Descriptors that gives priority to key structure selections and face positioning for Audio-visual input. A concise depiction of expression is offered by the suggested Self-Similarity Distance Matrix (SSDM), which uses facial highlight points to estimate both time and space correlations. Formant frequency range, energy sources, probabilistic properties, and spectroscopic aspects define the acoustic signal. The 98% accuracy rate is attained by the emotion recognition algorithm. Major improvements upon cutting-edge techniques are shown in validation studies on the SAVEE and RML information sets, highlighting the usefulness of the suggested model in identifying and categorising emotions and facial movements in a variety of contexts. The implementation of this research is done by using Python tool.