We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of $$100 \times 1600$$
100
×
1600
meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.