For structural health monitoring, a complete dataset is important for further analysis such as modal identification and risk early warning. Unfortunately, the missing data normally exist in current database due to sensor failures, transmission system interruption, and hardware malfunctions. Currently, most of the studies just deleted the dataset containing missing data or using mean values as imputation which could wrongly reflect the characteristics changes of the structure. The present study therefore develops a heterogeneous structural response recovery method based on multi-modal fusion auto-encoder which can consider temporal correlations and spatial correlations and correlations between heterogeneous structural responses simultaneously. Moreover, a parallel optimization method is proposed to optimize the parameters of the deep fusion networks. A dataset containing about 3 months and two input attributes is collected from a bridge and utilized for training and testing the proposed method and some benchmark methods. Statistical scores including root mean square error (RSME), mean absolute error (MAE), and mean relative error (MRE) are applied to evaluate the performance of the implemented models, respectively. Results show that the proposed method achieve the best imputation performance under different missing scenarios. Furthermore, the proposed method can achieve better performance when the missing rate is high. The results suggest that the consideration between heterogeneous structural responses is critical for missing data recovery.