Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoderdecoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-ofthe-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.