Monocular depth estimation is a fundamental problem for various vision applications, and is therefore gaining increasing attention in the field of computer vision. Though a great improvement has been made thanks to the rapid progress of deep convolutional neural networks, depth estimation of the object at finer details remains an unsatisfactory issue, especially in complex scenes that has rich structure information. In this paper, we proposed a deep end-to-end learning framework with the combination of multi-scale convolutions and joint attention mechanisms to tackle this challenge. Specifically, we firstly elaborately designed a lightweight up-convolution to generate multi-scale feature maps. Then we introduced an attention-based residual block to aggregate different feature maps in joint channel and spatial dimension, which could enhance the discriminant ability of feature fusion at finer details. Furthermore, we explored an effective adaptive weight adjustment strategy for the loss function to further improve the performance, which adjusts the weight of each loss term during training without additional hyperparameters. The proposed framework was evaluated using challenging NYU Depth v2 and KITTI datasets. Experimental results demonstrated that the proposed approach is superior to most of the state-of-the-art methods. INDEX TERMS Monocular depth estimation, multi-scale convolutions, joint attention mechanisms, weight adjustment.