Action detection of construction equipment is critical for tracking project performance, facilitating construction automation, and fostering construction efficiency in terms of construction site monitoring. Particularly, the auditory signal can provide additional information on computer vision‐based action detection of various types of construction equipment. Therefore, this study aims to develop a visual–auditory learning network model for the action detection of construction equipment based on two modalities (i.e., vision and audition). To this end, both visual and auditory features are extracted from the multi‐modal feature extractor. In addition, the multi‐head attention and detection module is designed to conduct the localization and classification tasks in separate heads in which different attention mechanisms for each task are applied. Particularly, the content‐based attention mechanism and the dot‐product attention mechanism are, respectively, adopted for spatial attention in the localization head and channel attention in the classification head. The evaluation results show that the precision and recall of the proposed model can reach 86.92% and 84.00% with the adoption of the multi‐head attention and detection module, which has proven to improve overall detection performance by utilizing different correlations of visual and auditory features for localization and classification, respectively.