“…The breakthroughs of the Transformer networks [60] in natural language processing (NLP) domain have sparked the interest of the computer vision community in developing vision transformers for different computer vision tasks, such as image classification [10,40], object detection [4,63,6,40], image segmentation [96,54,63,40], object tracking [80,81], pose estimation [42,58], etc. Among them, DPT [54] adopts a U-shape structure and uses ViT [10] as an encoder to perform semantic segmentation and monocular depth estimation.…”