The use of remote sensing imagery has significantly enhanced the efficiency of building extraction; however, the precise estimation of building height remains a formidable challenge. In light of ongoing advancements in computer vision, numerous techniques leveraging convolutional neural networks and Transformers have been applied to remote sensing imagery, yielding promising outcomes. Nevertheless, most existing approaches directly estimate height without considering the intrinsic relationship between semantic building segmentation and building height estimation. In this study, we present a unified architectural framework that integrates the tasks of building semantic segmentation and building height estimation. We introduce a Transformer model that systematically merges multi-level features with semantic constraints and leverages shallow spatial detail feature cues in the encoder. Our approach excels in both height estimation and semantic segmentation tasks. Specifically, the coefficient of determination (R2) in the height estimation task attains a remarkable 0.9671, with a root mean square error (RMSE) of 1.1733 m. The mean intersection over union (mIoU) for building semantic segmentation reaches 0.7855. These findings underscore the efficacy of multi-task learning by integrating semantic segmentation with height estimation, thereby enhancing the precision of height estimation.