The traditional multi-modal sentiment analysis (MSA) method usually considers the multi-modal characteristics to be equally important and ignores the contribution of different modes to the final MSA result. Therefore, an MSA method based on hierarchical adaptive feature fusion network is proposed. Firstly, RoBERTa, ResViT, and LibROSA are used to extract different modal features and construct a layered adaptive multi-modal fusion network. Then, the multi-modal feature extraction module and cross-modal feature interaction module are combined to realize the interactive fusion of information between modes. Finally, an adaptive gating mechanism is introduced to design a global multi-modal feature interaction module to learn the unique features of different modes. The experimental results on three public data sets show that the proposed method can make full use of multi-modal information, outperform other advanced comparison methods, improve the accuracy and robustness of sentiment analysis, and is expected to achieve better results in the field of sentiment analysis.