Modular multilevel converters (MMCs) have been widely applied in high voltage direct current (HVDC) transmission project, and security is their fundamental issue. Submodule (SM) switching Devices fault of MMC in HVDC is the most common problem, nevertheless, traditional fault diagnosis methods are mostly based on single modality resulting in poor feature extraction ability. To solve this pain spot, A novel multimodal attention fusion (MAF) model is proposed in this paper. Firstly, the three-phase internal circulating currents of MMC are converted into two-dimensional (2D) time-frequency image by performing Synchronous Squeezing Transform (SST). To automatically focus on the discriminative regions most related to fault features, a time series attention model(TSAM) and a visual attention model(VAM) are proposed to learn the features of the 2D time-frequency image and the internal circulating current time series data, respectively. Then, a multimodal attention model(MAM) via intermediate fusion is proposed, which utilizes the internal correlation between visual features and time series features for joint fault feature extraction. Finally, a later fusion scheme is applied to combine the fault prediction results of three attention models for fault diagnosis. The fault diagnosis accuracy of the proposed MAF model are 98.4% and 97.3% on the 31-level and 61-level MMC datasets, respectively. Experiments demonstrate that the designed model boosts feature representation power and achieves better fault diagnosis performance than the state-of-the-art baseline methods. INDEX TERMS Modular multilevel converter, fault diagnosis, attention model, multimodal fusion I. INTRODUCTION