Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross‐fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross‐fusion for each modality via layers of cross‐modal transformer blocks, resulting in cross‐modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross‐modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross‐fusion with the reconstructor structure on the proposed model.