Andreea Gane scite author profile

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

show abstract

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Choromański¹,

Likhosherstov²,

Dohan³

et al. 2020

Preprint

View full text Add to dashboard Cite

Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of these models. To address this challenge, we present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence. It is also backwards-compatible with pre-trained regular Transformers. We demonstrate its effectiveness on the challenging task of protein sequence modeling and provide detailed theoretical analysis.

show abstract

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Dohan

Gane

Bileschi

et al. 2021

View full text Add to dashboard Cite

Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder

Lorberbom¹,

Gane²,

Jaakkola³

et al. 2018

Preprint

View full text Add to dashboard Cite

Reparameterization of variational auto-encoders with continuous random variables is an effective method for reducing the variance of their gradient estimates. Our work optimizes the discrete VAE objective directly, using its Gumbel-Max reparameterization, by applying the direct loss minimization technique to generative models. This optimization technique propagates gradients through the reparameterized arg max, which are estimated by the difference of gradients of two arg max predictions. This realization provides the means to learn latent representations in cases when evaluating the arg max operation is tractable while evaluating the softmax operation is intractable.

show abstract

Learning Maximum A-Posteriori Perturbation Models

Gane¹,

Hazan²,

Jaakkola³

2016

View full text Add to dashboard Cite

Perturbation models are families of distributions induced from perturbations. They combine randomization of the parameters with maximization to draw unbiased samples. Unlike Gibbs' distributions, a perturbation model defined on the basis of low order statistics still gives rise to high order dependencies. In this paper, we analyze, extend and seek to estimate such dependencies from data. In particular, we shift the modelling focus from the parameters of the Gibbs' distribution used as a base model to the space of perturbations. We estimate dependent perturbations over the parameters using a hard-EM approach, cast in the form of inverse convex programs. Each inverse program confines the randomization to the parameter polytope responsible for generating the observed answer. We illustrate the method on several computer vision problems.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andreea Gane

Rethinking Attention with Performers

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder

Learning Maximum A-Posteriori Perturbation Models

Contact Info

Product

Resources

About