2021
DOI: 10.48550/arxiv.2106.06295
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Abstract: Transformers with linearised attention ("linear Transformers") have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s. However, the original FWP formulation is more general than the one of linear Transformers: a slow neural network (NN) continually reprograms the weights of a fast NN with arbitrary NN architectures. In existing linear Transformers, both NNs are feedforward and consist of a single layer. Here we explore new variations by… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 42 publications
0
3
0
Order By: Relevance
“…Parisotto et al (2020) address the problem of using transformers in RL and showed that adding gating layers on top of the transformers layers can stabilize training. Subsequent works addressed the increased computational load of using a transformer for an agent's policy (Irie et al, 2021;Parisotto & Salakhutdinov, 2021). Chen et al (2021); Janner et al (2021) take a different approach by modeling the RL problem as a sequence modeling problem and use a transformer to predict actions without additional networks for an actor or critic.…”
Section: Related Workmentioning
confidence: 99%
“…Parisotto et al (2020) address the problem of using transformers in RL and showed that adding gating layers on top of the transformers layers can stabilize training. Subsequent works addressed the increased computational load of using a transformer for an agent's policy (Irie et al, 2021;Parisotto & Salakhutdinov, 2021). Chen et al (2021); Janner et al (2021) take a different approach by modeling the RL problem as a sequence modeling problem and use a transformer to predict actions without additional networks for an actor or critic.…”
Section: Related Workmentioning
confidence: 99%
“…Liu et al propose a solution to the vanishing gradient problem in [51]. However, both networks are very complex and need a long time for training and to become effective [52].…”
Section: The Second Strategy: Distilbert Language Model (Transformers...mentioning
confidence: 99%
“…Since then the applications of DNNs have been increasing dramatically as the advanced graphic processing units (GPUs) and a significant boost in computing power. From 2009 to 2012, Jurgen Schmidhuber of IDSIA, a Swiss AI laboratory, developed feedforward neural network (FNN) [9], [10]. Hinton et al won the ImagNet 2012 [11], outperformed the second-place competitors in image classifying precision, therefore, leading to the current deep learning boom.…”
Section: Introductionmentioning
confidence: 99%