Approximating How Single Head Attention Learns

Snell, Charlie; Zhong, Ruiqi; Klein, Dan; Steinhardt, Jacob

doi:10.48550/arxiv.2103.07601

Cited by 3 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The role of attention in Transformers was studied by [WCM21, DGV + 18]. In terms of optimization, [ZKV + 20] examined the impact of adaptive approaches on attention models, while [SZKS21] analyzed the dynamics of single-head attention to approximate Seq2Seq architecture's learning process. For most LLMs, it generally suffices to conduct attention computations in an approximate manner during the inference process, provided that there are adequate assurances of accuracy.…”

Section: Algorithmic Regularizationmentioning

confidence: 99%

Solving Regularized Exp, Cosh and Sinh Regression Problems

Li¹,

Song²,

Zhou³

2023

Preprint

View full text Add to dashboard Cite

In modern machine learning, attention computation is a fundamental task for training large language models such as Transformer, GPT-4 and ChatGPT. In this work, we study exponential regression problem which is inspired by the softmax/exp unit in the attention mechanism in large language models. The standard exponential regression is non-convex. We study the regularization version of exponential regression problem which is a convex problem. We use approximate newton method to solve in input sparsity time.Formally, in this problem, one is given matrix A ∈ R n×d , b ∈ R n , w ∈ R n and any of functions exp, cosh and sinh denoted as f . The goal is to find the optimal x that minimize 0.5 f (Ax) − b 2 2 + 0.5 diag(w)Ax 2 2 . The straightforward method is to use the naive Newton's method. Let nnz(A) denote the number of non-zeros entries in matrix A. Let ω denote the exponent of matrix multiplication. Currently, ω ≈ 2.373. Let ǫ denote the accuracy error. In this paper, we make use of the input sparsity and purpose an algorithm that use log( x 0 −x * 2 /ǫ) iterations and O(nnz(A) + d ω ) per iteration time to solve the problem.

show abstract

Section: Algorithmic Regularizationmentioning

confidence: 99%

Solving Regularized Exp, Cosh and Sinh Regression Problems

Li¹,

Song²,

Zhou³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformers. There is a long line of work investigating the capabilities [Vaswani et al, 2017, Dehghani et al, 2018, Yun et al, 2019, Pérez et al, 2019, Yao et al, 2021, Bhattamishra et al, 2020b, Zhang et al, 2022, limitations [Hahn, 2020, Bhattamishra et al, 2020a, applications [Lu et al, 2021a, Dosovitskiy et al, 2020, Parmar et al, 2018, and internal workings [Elhage et al, 2021, Snell et al, 2021, Weiss et al, 2021, Edelman et al, 2022, Olsson et al, 2022 of Transformer models. Most similar to our work, Müller et al [2021] introduce a "Prior-data fitted transformer network" that is trained to approximate Bayesian inference and generate predictions for downstream learning problems.…”

Section: Related Workmentioning

confidence: 99%

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Garg¹,

Tsipras²,

Liang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions-that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes-namely sparse linear functions, two-layer neural networks, and decision trees-with performance that matches or exceeds task-specific learning algorithms. 1 * Equal contribution. 1 Our code and models are available at https://github.com/dtsip/in-context-learning.

show abstract

“…Optimization and Convergence In the realm of optimization, [SZKS21] concentrated on investigating the behavior of a single-head attention mechanism to emulate the process of learning a Seq2Seq model, while adaptive methods have been emphasized for attention models by [ZKV + 20].…”

Section: Transformer Theorymentioning

confidence: 99%

Relative Error Tensor Low Rank Approximation

Song¹,

Woodruff²,

Zhong³

2019

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an n × n × n third order tensor A, we say that A has cycle rank-k if there exists three n × k 2 size matrices U, V , and W such that for each entry in eachFor the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous "rotation and sketch" technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.

show abstract

Approximating How Single Head Attention Learns

Cited by 3 publications

References 16 publications

Solving Regularized Exp, Cosh and Sinh Regression Problems

Solving Regularized Exp, Cosh and Sinh Regression Problems

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Relative Error Tensor Low Rank Approximation

Contact Info

Product

Resources

About