Blue Horizontal Branch stars (BHBs) are ideal tracers to probe the global structure of the milky Way (MW), and the increased size of the BHB star sample could be helpful to accurately calculate the MW’s enclosed mass and kinematics. Large survey telescopes have produced an increasing number of astronomical images and spectra. However, traditional methods of identifying BHBs are limited in dealing with the large scale of astronomical data. A fast and efficient way of identifying BHBs can provide a more significant sample for further analysis and research. Therefore, in order to fully use the various data observed and further improve the identification accuracy of BHBs, we have innovatively proposed and implemented a Bi-level attention mechanism-based Transformer multimodal fusion model, called Bi-level Attention in the Transformer with Multimodality (BATMM). The model consists of a spectrum encoder, an image encoder, and a Transformer multimodal fusion module. The Transformer enables the effective fusion of data from two modalities, namely image and spectrum, by using the proposed Bi-level attention mechanism, including cross-attention and self-attention. As a result, the information from the different modalities complements each other, thus improving the accuracy of the identification of BHBs. The experimental results show that the F1 score of the proposed BATMM is 94.78%, which is 21.77% and 2.76% higher than the image and spectral unimodality, respectively. It is therefore demonstrated that higher identification accuracy of BHBs can be achieved by means of using data from multiple modalities and employing an efficient data fusion strategy.