With the increasing influence of online public opinion, mining opinions and trend analysis from massive data of online media is important for understanding user sentiment, managing brand reputation, analyzing public opinion and optimizing marketing strategies. By combining data from multiple perceptual modalities, more comprehensive and accurate sentiment analysis results can be obtained. However, using multimodal data for sentiment analysis may face challenges such as data fusion, modal imbalance and inter-modal correlation. To overcome these challenges, the paper introduces an attention mechanism to multimodal sentiment analysis by constructing text, image, and audio feature extractors and using a custom cross-modal attention layer to compute the attention weights between different modalities, and finally fusing the attention-weighted features for sentiment classification. Through the cross-modal attention mechanism, the model can automatically learn the correlation between different modalities, dynamically adjust the modal weights, and selectively fuse features from different modalities, thus improving the accuracy and expressiveness of sentiment analysis.