“…In recent years, vision transformers (ViTs) [1] have gradually surpassed and replaced Convolution Neural Network (CNN) and found wide applications in various downstream tasks of medical imaging, including segmentation [2] , [3] , [4] , [5] , classification [6] , [7] , [8] , [9] , restoration [10] , [11] , [12] , [13] , synthesis [14] , [15] , [16] , [17] , registration [18] , [19] , [20] , [21] , and object detection in medical images [22] , [23] . In particular, significant progress has been observed in 3D medical image segmentation with the adoption of Vision Transformers (ViTs) [24] , [25] , [26] , [27] , [28] .…”