“…Recently, MPViT [66] explores multi-scale patch embedding and multi-path structure that enable both fine and coarse feature representations simultaneously. Benefiting from advances in basic vision transformer models, many task-specific models are proposed and achieve significant progress in downstream vision tasks, e.g., object detection [13,134,43,22], semantic segmentation [130,17,98,116,126,29,28], generative adversarial network [59,102,58], low-level vision [16,72,127], video understanding [81,7,94,104,118], self-supervised learning [2,24,80,53,4,14,38,23,117,110,3,45], neural architecture search [103,67,15,19,20], etc. Inspired by practical improvements in EA variants, this work migrates them to Transformer improvement and designs a powerful visual model with higher precision and efficiency than contemporary works.…”