Abstract. Fast and efficient detection and reconstruction of buildings have become essential in real-time applications such as navigation, 3D rendering, augmented reality, and 3D smart cities. In this study, a modern Deep Learning (DL)-based framework is proposed for automatic detection, localization, and height estimation of buildings, simultaneously, from a single aerial image. The proposed framework is based on a Y-shaped Convolutional Neural Network (Y-Net) which includes one encoder and two decoders. The input of the network is a single RGB image, while the outputs are predicted height information of buildings as well as the rooflines in three classes of eave, ridge, and hip lines. The extracted knowledge by the Y-Net (i.e. buildings’ heights and rooflines) is utilized for 3D reconstruction of buildings based on the third Level of Detail (LoD2). The main steps of the proposed approach are data preparation, CNNs training, and 3D reconstruction. For the experimental investigations airborne data from Potsdam are used, which were provided by ISPRS. For the predicted heights, the results show an average Root Mean Square Error (RMSE) and a Normalized Median Absolute Deviation (NMAD) of about 3.8 m and 1.3 m, respectively. Moreover, the overall accuracy of the extracted rooflines is about 86%.