When conducting sentiment analysis on social networks, facing the challenge of temporal and multi-modal data, it is necessary to enable the model to deeply mine and combine information from various modalities. Therefore, this study constructs an emotion analysis model based on multitask learning. This model utilizes a comprehensive framework of convolutional networks, bidirectional gated recurrent units, and multi head selfattention mechanisms to represent single modal temporal features in an innovative way, and adopts a cross modal feature fusion strategy. The experiment showed that the model accomplished 0.83 average precision and a 0.83 F1-value, respectively. In contrast with multi-scale attention (0.69, 0.70), aspect-based sentiment analysis (0.78, 0.74), and long short-term memory network (0.71, 0.78) models, this model demonstrated higher robustness and classification accuracy. Especially in terms of parallel computing efficiency, the acceleration ratio of the model reached 1.61, which is the highest among all compared models, highlighting the potential for time savings in large data volumes. This study has shown good performance in sentiment analysis in social networks, providing a novel perspective for solving complex sentiment classification problems.