Multi-view stereo (MVS) is a crucial task for precise 3D reconstruction. Most recent studies tried to improve the performance of matching cost volume in MVS by designing aggregated 3D cost volumes and their regularization. This paper focuses on learning a robust feature extraction network to enhance the performance of matching costs without heavy computation in the other steps. In particular, we present a dynamic scale feature extraction network, namely, CDSFNet. It is composed of multiple novel convolution layers, each of which can select a proper patch scale for each pixel guided by the normal curvature of the image surface. As a result, CDFSNet can estimate the optimal patch scales to learn discriminative features for accurate matching computation between reference and source images. By combining the robust extracted features with an appropriate cost formulation strategy, our resulting MVS architecture can estimate depth maps more precisely. Extensive experiments showed that the proposed method outperforms other state-of-the-art methods on complex outdoor scenes. It significantly improves the completeness of reconstructed models. As a result, the method can process higher resolution inputs within faster run-time and lower memory than other MVS methods. Our source code is available at https://github.com/TruongKhang/cds-mvsnet
This study addresses an image-matching problem in challenging cases, such as large scene variations or textureless scenes. To gain robustness to such situations, most previous studies have attempted to encode the global contexts of a scene via graph neural networks or transformers. However, these contexts do not explicitly represent high-level contextual information, such as structural shapes or semantic instances; therefore, the encoded features are still not sufficiently discriminative in challenging scenes. We propose a novel image-matching method that applies a topic-modeling strategy to encode high-level contexts in images. The proposed method trains latent semantic instances called topics. It explicitly models an image as a multinomial distribution of topics, and then performs probabilistic feature matching. This approach improves the robustness of matching by focusing on the same semantic areas between the images. In addition, the inferred topics provide interpretability for matching the results, making our method explainable. Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods, particularly in challenging cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.