ABSTRACT:Publicly available benchmark data and metric evaluation approaches have been instrumental in enabling research to advance state of the art methods for remote sensing applications in urban 3D modeling. Most publicly available benchmark datasets have consisted of high resolution airborne imagery and lidar suitable for 3D modeling on a relatively modest scale. To enable research in larger scale 3D mapping, we have recently released a public benchmark dataset with multi-view commercial satellite imagery and metrics to compare 3D point clouds with lidar ground truth. We now define a more complete metric evaluation pipeline developed as publicly available open source software to assess semantically labeled 3D models of complex urban scenes derived from multi-view commercial satellite imagery. Evaluation metrics in our pipeline include horizontal and vertical accuracy and completeness, volumetric completeness and correctness, perceptual quality, and model simplicity. Sources of ground truth include airborne lidar and overhead imagery, and we demonstrate a semi-automated process for producing accurate ground truth shape files to characterize building footprints. We validate our current metric evaluation pipeline using 3D models produced using open source multi-view stereo methods. Data and software is made publicly available to enable further research and planned benchmarking activities.