This paper presents a novel unsupervised learning framework for estimating scene depth and camera pose from video sequences, fundamental to many high-level tasks such as 3D reconstruction, visual navigation, and augmented reality. Although existing unsupervised methods have achieved promising results, their performance suffers in challenging scenes such as those with dynamic objects and occluded regions. As a result, multiple mask technologies and geometric consistency constraints are adopted in this research to mitigate their negative impacts. Firstly, multiple mask technologies are used to identify numerous outliers in the scene, which are excluded from the loss computation. In addition, the identified outliers are employed as a supervised signal to train a mask estimation network. The estimated mask is then utilized to preprocess the input to the pose estimation network, mitigating the potential adverse effects of challenging scenes on pose estimation. Furthermore, we propose geometric consistency constraints to reduce the sensitivity of illumination changes, which act as additional supervised signals to train the network. Experimental results on the KITTI dataset demonstrate that our proposed strategies can effectively enhance the model’s performance, outperforming other unsupervised methods.