Studies have shown that the observed image texture details and semantic information are of great significance for the depth estimation on the road scenes. However, there are ambiguities and inaccuracies in the boundary information of observed objects in previous methods. For this reason, we hope to design a new depth estimation method that can obtain higher accuracy and more accurate boundary information of the detected object. Based on polarized self-attention (PSA) and feature pyramid U-net, we proposed a new self-supervised monocular depth estimation model to extract more accurate texture details and semantic information. Firstly, we add a PSA module at the end of the depth encoder and pose encoder so that the network can extract more accurate semantic information. Then, based on the U-net, we put the multi-scale image obtained by the object detection module FPN (Feature Pyramid network) directly into the decoder. It can guide the model to learn semantic information, thus enhancing the boundary of the image. We evaluated our method on KITTI 2015 datasets and Make3D datasets, and our model achieved better results than previous studies. In order to verify the generalization of the model, we have done monocular, stereo, monocular plus stereo experiments. The experimental results show that our model has achieved better results in several main evaluation indexes and clearer boundary information. In order to compare different forms of PSA mechanism, we did ablation experiments. Compared with no PSA module, after adding the PSA module, better results in evaluating indicators were achieved. We also found that our model is better in monocular training than stereo training and monocular plus stereo training.