Building extraction and polygonization is important for urban studies, such as urbanization monitoring, urban planning. Remote sensing images, especially in RGB bands, provide sufficient semantic information which is useful for the task of building extraction and polygonization. Deep learning using Convolutional Neural Networks (CNNs) is proven to be successful in many fields, including building extraction from remote sensing images. In this paper, we propose a two-stage method to solve the task of building polygonization from remote sensing images based on deep learning. Firstly, we decompose a 2‑D building footprint model into three basic geometry primitives. Leveraging stacked Multi-Branch Modules (MBMs), we separate the task of building extraction into tasks of predicting the three geometry primitives using our proposed CNN. At the second stage, we propose an efficient enhanced building polygonization and adjustment algorithm to generate the final building polygons. This algorithm is able to handle both building blocks and individual buildings. We evaluate our model on three open datasets. For building blocks, our model achieved average precision of 62.7% and average recall of 73.6% on the CrowdAI mapping challenge dataset, and 13.9% and 24.4% respectively on the Urban Building Classification (UBC) dataset which contains mainly individual buildings. On the Inria aerial image dataset, the proposed method achieved Intersection over Union (IoU) over 71%.