As a branch of sentiment analysis tasks, emotion recognition in conversation (ERC) aims to explore the hidden emotions of a speaker by analyzing the sentiments in utterance. In addition, emotion recognition in multimodal data from conversation includes the text of the utterance and its corresponding acoustic and visual data. By integrating features from various modalities, the emotion of utterance can be more accurately predicted. ERC research faces challenges in context construction, speaker dependency design, and multimodal heterogeneous feature fusion. Therefore, this review starts by defining the ERC task, developing the research work, and introducing the utilized datasets in detail. Simultaneously, we analyzed context modeling in conversations, speaker dependency, and methods for fusing multimodal information concerning existing research work for evaluation purposes. Finally, this review also explores the research, application challenges, and opportunities of ERC.