A multimodal emotion recognition system is proposed using speech and facial images. For this purpose a video database is developed, containing emotions in three affective states viz. anger, sad and happiness. The audio and the snapshots of facial expressions acquired from the videos
constituted the bimodal input for recognizing emotions. The spoken sentences in the database included text dependent as well as text independent sentences in Malayalam language. The audio features included short-time processing of speech to obtain: energy, zero crossing count, pitch and Mel
Frequency Cepstral Coefficients. For facial expressions, the landmark features of face: eyebrows, eyes and mouth, obtained using Viola Jones Algorithm is used. The supervised learning methods K-Nearest Neighbor and Artificial Neural Network are used for emotion analysis. The system performance
is evaluated for 3 cases viz. using audio based features and facial features separately and for both features taken together. Further, the effect of text dependent and text independent audio is also analyzed. The result of the analysis shows that text independent videos (utilizing both modalities)
using K-Nearest Neighbor (highest accuracy 82.78%) is found to be more effective in recognizing emotions from the database considered.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.