EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Zhang, Jiangning; Li, Xiangtai; Wang, Yabiao; Wang, Chengjie; Yang, Yibo; Liu, Yong; Tao, Dacheng

doi:10.48550/arxiv.2206.09325

Cited by 5 publications

(6 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are mainly three different directions for Transformer in vision: representation learning as a feature extractor, vision-language modeling, and using object query for downstream detection-related tasks. For the first aspect, ViTs [27], [58], [59], [60] have more advantages in modeling global-range relation among the image patch features. Most recent works [61], [62] combine the local CNN design with ViTs.…”

Section: Related Workmentioning

confidence: 99%

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

Li¹,

Xu²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available at https://github.com/lxtGH/Panoptic-PartFormer.

show abstract

Section: Related Workmentioning

confidence: 99%

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

Li¹,

Xu²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…For the input face image 𝐼 , with two different masks 𝑀 𝑏 and 𝑀 𝑜 , we can get two masked face image 𝐼 𝑀 𝑏 and 𝐼 𝑀 𝑜 as positive pair, we expect that the model can recognize they come from the same face image. We use the class token of the Vision Transformer [9,[27][28][29] as the identifying label and a teacher-student framework to get the predictive categorical distributions.…”

Section: Random Mask For Contrastive Learningmentioning

confidence: 99%

Toward High Quality Facial Representation Learning

Wang,

Peng,

Zhang

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called Mask Contrastive Face (MCF), with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME 𝑑𝑖𝑎𝑔 for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.

show abstract

“…Plain Vision Transformer. Since Vision Transformer (ViT) [18] first introduced Transformer [91] structure into visual classification successfully, massive improvements have been subsequently developed [92], [93], [94], [95], [96], [97], [98]. Benefiting from global dynamic modeling capabilities, columnar plain ViT offers more excellent usability and practical values compared to the more complex pyramidal structures.…”

Section: Related Workmentioning

confidence: 99%

“…Thanks to the global modeling capability of Multi-Head Self-Attention (MHSA), ViT can simultaneously pay attention to distant low-frequency information and close high-frequency information [97], [110]. That is what CNN, with the local modeling manner, does not have.…”

Section: Advantage Explanation Of Vitmentioning

confidence: 99%

Diminishing Empirical Risk Minimization for Unsupervised Anomaly Detection

Wang

Liu

Chen

et al. 2022

2022 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

This work studies the recently proposed challenging and practical Multi-class Unsupervised Anomaly Detection (MUAD) task, which only requires normal images for training while simultaneously testing both normal/anomaly images for multiple classes. Existing reconstruction-based methods typically adopt pyramid networks as encoders/decoders to obtain multi-resolution features, accompanied by elaborate sub-modules with heavier handcraft engineering designs for more precise localization. In contrast, a plain Vision Transformer (ViT) with simple architecture has been shown effective in multiple domains, which is simpler, more effective, and elegant. Following this spirit, this paper explores plain ViT architecture for MUAD. Specifically, we abstract a Meta-AD concept by inducing current reconstruction-based methods. Then, we instantiate a novel and elegant plain ViT-based symmetric ViTAD structure, effectively designed step by step from three macro and four micro perspectives. In addition, this paper reveals several interesting findings for further exploration. Finally, we propose a comprehensive and fair evaluation benchmark on eight metrics for the MUAD task. Based on a naive training recipe, ViTAD achieves state-of-the-art (SoTA) results and efficiency on the MVTec AD and VisA datasets without bells and whistles, obtaining 85.4 mAD that surpasses SoTA UniAD by +3.0↑, and only requiring 1.1 hours and 2.3G GPU memory to complete model training by a single V100 GPU. Source code, models, and more results are available at github.com/zhangzjn/ADer and website.

show abstract

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Cited by 5 publications

References 68 publications

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

Toward High Quality Facial Representation Learning

Diminishing Empirical Risk Minimization for Unsupervised Anomaly Detection

Contact Info

Product

Resources

About