Recently, Convolutional Neural Networks (CNN) based deep models have been successfully applied to the task of stereo matching. In this paper, we propose a novel deep stereo matching network based on the strategies of dense feature learning and compact cost aggregation, namely DFL-CCA-Net. It consists of three modules: Dense Feature Learning (DFL), Compact Cost Aggregation (CCA) and the disparity regression module. In DFL module, the CNN backbone with Dense Atrous Spatial Pyramid Pooling (DenseASPP) is employed to extract multi-scale deep feature maps of the given left and right images respectively. Then an initial 4D cost volume is obtained by concatenating left feature maps with their corresponding right feature maps across each disparity level. In the following CCA module, each initial 3D cost volume component (i.e., the component across the left or right image feature channel dimension) is aggregated into a more compact one by using the atrous convolution operation with different expansion rates. These updated 3D cost volume components are then fed into the disparity regression module, which consisting of a 3D CNN network with a stacked hourglass structure, to estimate the final disparity map. Comprehensive experimental results demonstrated on the Scene Flow, KITTI 2012 and KITTI 2015 datasets show that the 3D cost volume components obtained by the proposed DFL and CCA modules generally containing more multi-scale semantic information and thus can largely improve the final disparity regression accuracies. Compared with other deep stereo matching methods, DFL-CCA-Net achieves very competitive prediction accuracies especially in the reflective regions and regions containing detail information. INDEX TERMS Deep stereo matching, dense feature learning, compact cost aggregation
Unsupervised deep learning methods have shown great success in jointly estimating camera pose and depth from monocular videos. However, previous methods mostly ignore the importance of multi-scale information, which is crucial for pose estimation and depth estimation, especially when the motion pattern is changed. This article proposes an unsupervised framework for monocular visual odometry (VO) that can model multi-scale information. The proposed method utilizes densely linked atrous convolutions to increase the receptive field size without losing image information, and adopts a non-local self-attention mechanism to effectively model the long-range dependency. Both of them can model objects of different scales in the image, thereby improving the accuracy of VO, especially in rotating scenes. Extensive experiments on the KITTI dataset have shown that our approach is competitive with other state-of-the-art unsupervised learning-based monocular methods and is comparable to supervised or model-based methods. In particular, we have achieved state-of-the-art results on rotation estimation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.