Semantic segmentation in aerial images has become an indispensable part in remote sensing image understanding for its extensive application prospects. It is crucial to jointly reason the 2D appearance along with 3D information and acquire discriminative global context to achieve better segmentation. However, previous approaches require accurate elevation data (e.g., nDSM and DSM) as additional inputs to segment semantics which sorely limits their applications. On the other hand, due to the various forms of objects in complex scenes, the global context is generally dominated by features of salient patterns (e.g., large objects) and tends to smooth inconspicuous patterns (e.g., small stuff and boundaries). In this article, a novel joint framework named Height-Embedding Context Reassembly Network (HECR-Net) is proposed. First, considering the fact that the corresponding elevation data is insufficient while we still want to exploit the serviceable height information. To alleviate the above data constraint, our method simultaneously predicts semantic labels and height maps from single aerial images by distilling height-aware embeddings implicitly. Second, we introduce a novel context-aware Reorganization (CAR) module to generate a discriminative feature with global context appropriately assigned to each local position. It benefits from both the global context aggregation module for ambiguity eliminating and local feature redistribution module for detail refinement. Third, we make full use of the learning height-aware embeddings to promote the performance of semantic segmentation via introducing a modality-affinitive propagation (MAP) block. Finally, without bells and whistles, the segmentation results on ISPRS Vaihingen and Potsdam data set illustrate that the proposed HECR-Net achieves state-of-the-art performance.