The aim of the paper is to identify a suitable method for the construction of a 3D city model from stereo satellite imagery. In order to reach this goal, it is necessary to build a workflow consisting of three main steps: (1) Increasing the geometric resolution of the color images through the use of pan-sharpening techniques, (2) identification of the buildings’ footprint through deep-learning techniques and, finally, (3) building an algorithm in GIS (Geographic Information System) for the extraction of the elevation of buildings. The developed method was applied to stereo imagery acquired by WorldView-2 (WV-2), a commercial Earth-observation satellite. The comparison of the different pan-sharpening techniques showed that the Gram–Schmidt method provided better-quality color images than the other techniques examined; this result was deduced from both the visual analysis of the orthophotos and the analysis of quality indices (RMSE, RASE and ERGAS). Subsequently, a deep-learning technique was applied for pan sharpening an image in order to extract the footprint of buildings. Performance indices (precision, recall, overall accuracy and the F1 measure) showed an elevated accuracy in automatic recognition of the buildings. Finally, starting from the Digital Surface Model (DSM) generated by satellite imagery, an algorithm built in the GIS environment allowed the extraction of the building height from the elevation model. In this way, it was possible to build a 3D city model where the buildings are represented as prismatic solids with flat roofs, in a fast and precise way.