Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Melas-Kyriazi, Luke

doi:10.48550/arxiv.2105.02723

Cited by 39 publications

(65 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we detailedly review the structure of the latest so-called pioneering MLP model, MLP-Mixer [36], and followed by a brief review of the contemporaneous ResMLP [40] as well as the Feed-forward [37]. After that, we strip the new paradigm, MLP, from the network and elaborate its differences and connections with convolution and self-attentive mechanisms.…”

Section: Pioneering Model and New Paradigmmentioning

confidence: 99%

“…Here σ is an element-wise nonlinearity (GELU [79]), and LayerNorm(•) denotes the layer normalization [48] widely used in Transformer-based models. W 3 ∈ R rC×C represents weights of a fully-connected layer increasing the feature Compared to MLP-Mixer, Feed-forward (FF) [37] and ResMLP [40] are put on arXiv 2 a few days later. Feedforward [37] adopts essentially the same structure as the MLP-Mixer, just swaps the order of Channel-mixing MLP and Token-mixing MLP, and is not repeated here.…”

Section: Structure Of Pioneering Modelmentioning

confidence: 99%

“…W 3 ∈ R rC×C represents weights of a fully-connected layer increasing the feature Compared to MLP-Mixer, Feed-forward (FF) [37] and ResMLP [40] are put on arXiv 2 a few days later. Feedforward [37] adopts essentially the same structure as the MLP-Mixer, just swaps the order of Channel-mixing MLP and Token-mixing MLP, and is not repeated here. As another contemporaneous work, ResMLP [40] simplifies the Token-mixing MLP in MLP-Mixer from two layers to one.…”

Section: Structure Of Pioneering Modelmentioning

confidence: 99%

“…In the first week of May 2021, when the Transformer craze is not over yet, MLP makes a comeback, with more hidden layers and compromise input flattening. In particular, researchers from four different institutions: Google [36], Oxford University [37], Tsinghua University [38,39], and Facebook [40], all question at almost the same time: is the convolution layer or attention layer even necessary? Or, is the current ready for a new paradigm shift?…”

Section: Introductionmentioning

confidence: 99%

“…Or, is the current ready for a new paradigm shift? By simply stacking a series of fully connected layers applied over the patch and feature dimensions in an alternating fashion, researchers obtain a performance that is only a few points weaker than CNNs and ViT on ImageNet [36,37,40]. These pure deep MLP-based architectures retain the global receptive field, introduce a few inductive biases 1 on learning from raw data.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Liu¹,

Li²,

Tao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit. But constrained by the hardware computing power and the size of the datasets, it once sank for tens of years. During this period, we have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive field, and further to the Transformer with global receptive field based on selfattention mechanism. And this year (2021), with the introduction of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive research from the computer vision community. Compare to the conventional MLP, it gets deeper but changes the input from full flattening to patch flattening. Given its high performance and less need for vision-specific inductive bias, the community can't help but wonder, Will deep MLP, the simplest structure with global receptive field but no attention, become a new computer vision paradigm? To answer this question, this survey aims to provide a comprehensive overview of the recent development of deep MLP models in vision. Specifically, we review these MLPs in detail, from the subtle sub-module design to the global network structure. We compare the receptive field, computational complexity, and other properties of different network designs in order to understand the development path of MLPs clearly. The investigation shows that MLPs' resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest that the current data volume and computational power are not ready to embrace pure MLPs, and artificial visual guidance remains important. Finally, we provide our viewpoint about open research directions and potential future works. We hope this effort will ignite further interest in the community and encourage better visual tailored design for the neural network in the future.

show abstract

Section: Pioneering Model and New Paradigmmentioning

confidence: 99%

Section: Structure Of Pioneering Modelmentioning

confidence: 99%

Section: Structure Of Pioneering Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Liu¹,

Li²,

Tao³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Hyperplane patch mixing-and-folding decoder and weighted chamfer distance loss for 3D point set reconstruction

et al. 2022

View full text Add to dashboard Cite

Abstract3D point set reconstruction is an important and challenging 3D shape analysis task. Current state-of-the-art algorithms for 3D point set reconstruction employ a deep neural network (DNN) having an encoder–decoder architecture. Recently, the decoder DNNs that transform multiple 2D planar patches to reconstruct a 3D shape have seen some success. These “patch-folding” decoders are adept at approximating smooth surfaces in 3D objects. However, 3D point sets generated by these decoders often lack local geometrical details, as 2D planar patches tend to overly constrain the patch folding process. In this paper, we propose a novel decoder DNN for 3D point sets called Hyperplane Mixing and Folding Net (HMF-Net). HMF-Net uses less constrained hyperplane, not 2D plane, patches as its input to the folding process. HMF-Net has, as its core building block, a stack of token-mixing layers to effectively learn global consistency among the hyperplane patches. In addition to HMF-Net, we also propose a novel loss for 3D point set reconstruction called Weighted Chamfer Distance (WCD). WCD tries to weight, or amplify, loss from parts of shape that are highly variable across training samples by emphasizing higher point-pair distance values between a generated point set and a groundtruth point set. This helps the decoder DNN learn shape details better. We comprehensively evaluate our algorithm under three 3D point set reconstruction scenarios, that are, shape completion, shape upsampling, and shape reconstruction from 2D images. Experimental results demonstrate that our algorithm yields accuracies higher than the existing algorithms for 3D point set reconstruction.

show abstract

LAMEE: a light all-MLP framework for time series prediction empowering recommendations

Xie,

Xiong,

Gao

et al. 2024

World Wide Web

View full text Add to dashboard Cite

Exogenous variables, unrelated to the recommendation system itself, can significantly enhance its performance. Therefore, integrating these time-evolving exogenous variables into a time series and conducting time series predictions can maximize the potential of recommendation systems. We refer to this task as Time Series Prediction Empowering Recommendations (TSPER). However, as a subtask within the recommendation system, TSPER faces unique challenges such as computational and data constraints, system evolution, and the need for performance and interpretability. To meet these unique needs, we propose a lightweight Multi-Layer Perceptron architecture with joint Time-Frequency information, named Light All-MLP with joint TimE-frEquency information (LAMEE). LAMEE utilizes a lightweight MLP architecture to achieve computing efficiency and adaptive online learning. Moreover, various strategies have been employed to improve the model, ensuring stable performance and model interpretability. Across multiple time series datasets potentially related to recommendation systems, LAMEE balances performance, efficiency, and interpretability, overall surpassing existing complex methods.

show abstract

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Cited by 39 publications

References 7 publications

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Hyperplane patch mixing-and-folding decoder and weighted chamfer distance loss for 3D point set reconstruction

LAMEE: a light all-MLP framework for time series prediction empowering recommendations

Contact Info

Product

Resources

About