In damage‐level classification, deep learning. models are more likely to focus on regions unrelated to classification targets because of the complexities inherent in real data, such as the diversity of damages (e.g., crack, efflorescence, and corrosion). This causes performance degradation. To solve this problem, it is necessary to handle data complexity and uncertainty. This study proposes a multimodal deep learning model that can focus on damaged regions using text data related to damage in images, such as materials and components. Furthermore, by adjusting the effect of attention maps on damage‐level classification performance based on the confidence calculated when estimating these maps, the proposed method realizes an accurate damage‐level classification. Our contribution is the development of a model with an end‐to‐end multimodal attention mechanism that can simultaneously consider both text and image data and the confidence of the attention map. Finally, experiments using real images validate the effectiveness of the proposed method.