The identification of colored steel buildings in images is crucial for managing the construction sector, environmental protection, and sustainable urban development. Current deep learning methods for optical remote sensing images often encounter challenges such as confusion between the roof color or shape of regular buildings and colored steel structures. Additionally, common semantic segmentation networks exhibit poor generalization and inadequate boundary regularization when extracting colored steel buildings. To overcome these limitations, we utilized the metal detection and differentiation capabilities inherent in synthetic aperture radar (SAR) data to develop a network that integrates optical and SAR data. This network, employing a triple-input structure, effectively captures the unique features of colored steel buildings. We designed a multimodal hybrid attention module in the network that discerns the varying importance of each data source depending on the context. Additionally, a boundary refinement (BR) module was introduced to extract the boundaries of the colored steel buildings in a more regular manner, and a deep supervision strategy was implemented to improve the performance of the network in the colored steel building extraction task. A BR module and deep supervision strategy were also implemented to sharpen the extraction of building boundaries, thereby enhancing the network’s accuracy and adaptability. The results indicate that, compared to mainstream semantic segmentation, this method effectively enhances the precision of colored steel building detection, achieving an accuracy rate of 83.19%. This improvement marks a significant advancement in monitoring illegal constructions and supporting the sustainable development of the Beijing–Tianjin–Hebei metropolitan region.