“…Different kinds of backbone, such as VGGNet, ResNet, HRNet and PackNet, made their way into the self-supervised monocular depth estimation task [71,17,69,18]. Moreover, to improve the feature extraction and processing ability, new frameworks like HR-Depth [30] and CADepth [57] also introduced attention modules. However, we argue that a shared shortcoming of existing self-supervised models falls in the reduced receptive field of Convolutional Neural Networks (CNNs).…”