Building structural types (BSTs) information is vital for seismic risk and vulnerability modeling. However, obtaining this kind of information is not a trivial task. The conventional method involves a labor-intensive and inefficient manual inspection process for each building. Nowadays, a few methods have explored to use remote sensing images and some building-related knowledge (BRK) to realize automated BSTs recognition. However, these methods have many limitations, such as insufficient mining of multimodal information and difficulty obtaining BRK, which hinders their promotion and practical use. To alleviate the shortcomings above, we propose a deep multimodal fusion model, which combines satellite optical remote sensing image, aerial Synthetic Aperture Radar (SAR) image, and BRK (roof type, color, and group pattern) obtained by domain experts to achieve accurate automatic reasoning of BSTs. Specifically, first, we use a pseudo-siamese network to extract the image feature. Second, a knowledge graph (KG) based on the BRK is constructed, and then we use a Graph Attention Network (GAT) to extract the semantic feature from the KG. Third, we propose a novel multi-stage gated fusion mechanism to fuse the image and semantic feature. Our method's best overall accuracy (OA) and kappa coefficient on the dataset collected in the study area are 90.35% and 0.83, which outperforms a series of existing methods. Through our model, high-precision BSTs information can be obtained for earthquake disaster prevention, reduction, and emergency decision-making.