The integration of aerial and ground images is known to be effective for enhancing the quality of 3-D reconstruction in complex urban scenarios. However, directly applying the structurefrom-motion (SfM) technique for unified 3-D reconstruction with aerial and ground images is particularly difficult, due to the large differences in viewpoint, scale, and appearance between those two types of images. Previous studies mainly rely on viewpoint rectification or view rendering/synthesis to improve the feature matching quality for aligning the aerial and ground models. Nevertheless, these approaches still fail to address the inherent information differences between aerial and ground images. In this article, we propose a learning-based matching framework for direct SfM with ground and aerial images. The key idea of our method is to learn the pixel-wise consistent features between aerial and ground images to handle the large heterogeneity of these two types of images. Specifically, we deploy a learning-based matching framework to robustly correspond the aerial and ground images. With the high-quality feature matching, learned feature maps are used for refining keypoint locations and fusing featuremetric error into bundle adjustment with the consideration of geometric error, both of which can further improve the accuracy and completeness of the recovered 3-D scene. Extensive experiments conducted on six datasets demonstrate that the proposed method can reconstruct high-fidelity 3-D models with direct aerial-to-ground SfM, which cannot be achieved by existing methods. In addition, our method also shows outstanding performance in subtasks of feature matching and point cloud recovery.