“…This enables the interleaving of convolutional and transformer-based blocks in the encoder and decoder, fully leveraging their respective advantages in feature extraction. TFCNs [64] introduce the Convolutional Linear Attention Block (CLAB), which encompasses two types of attention: spatial attention over the image's spatial extent and channel attention over CNN-style feature channels. The aforementioned methods simply rearrange the CNN and transformer modules within the encoder and decoder structures of UNet, without introducing any novel modules, resulting in limited improvements in segmentation performance.…”