“…Recently, 13 proposed the Vision Transformer (ViT) for large-scale image recognition where they applied Transformers directly to images by splitting image patches with standard linear embedding and feeding as a sequence of input tokens to Transformers. A significant amount of research activities have been carried out by incorporating the ViT as an encoder for miscellaneous applications such as image recognition, 14 semantic segmentation 15 video understanding, 16 image processing, 17 etc. Further, in the domain of medical image imaging, researchers have explored the Transformer-based architectures for medical image segmentation (TransUNet, 11 TransFuse, 12 SSFormer, 18 etc).…”