Sentiment analysis (SA) aims to understand the attitudes and views of opinion holders with computers. Previous studies have achieved significant breakthroughs and extensive applications in the past decade, such as public opinion analysis and intelligent voice service. With the rapid development of deep learning, SA based on various modalities has become a research hotspot. However, only individual modality has been analyzed separately, lacking a systematic carding of comprehensive SA methods. Meanwhile, few surveys covering the topic of multimodal SA (MSA) have been explored yet. In this article, we first take the modality as the thread to design a novel framework of SA tasks to provide researchers with a comprehensive understanding of relevant advances in SA. Then, we introduce the general workflows and recent advances of single-modal in detail, discuss the similarities and differences of single-modal SA in data processing and modeling to guide MSA, and summarize the commonly used datasets to provide guidance on data and methods for researchers according to different task types. Next, a new taxonomy is proposed to fill the research gaps in MSA, which is divided into multimodal representation learning and multimodal data fusion. The similarities and differences between these two methods and the latest advances are described in detail, such as dynamic interaction between multimodalities, and the multimodal fusion technologies are further expanded. Moreover, we explore the advanced studies on multimodal alignment, chatbots, and Chat Generative Pre-trained Transformer (ChatGPT) in SA. Finally, we discuss the open research challenges of MSA and provide four potential aspects to improve future works, such as cross-modal contrastive learning and multimodal pretraining models.