2021
DOI: 10.48550/arxiv.2103.07601
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Approximating How Single Head Attention Learns

Abstract: Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word i to o if they cooccur frequently. Later, the model learns to attend to i while the correct output is o because it knows i translates to o. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that i transla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…The role of attention in Transformers was studied by [WCM21, DGV + 18]. In terms of optimization, [ZKV + 20] examined the impact of adaptive approaches on attention models, while [SZKS21] analyzed the dynamics of single-head attention to approximate Seq2Seq architecture's learning process. For most LLMs, it generally suffices to conduct attention computations in an approximate manner during the inference process, provided that there are adequate assurances of accuracy.…”
Section: Algorithmic Regularizationmentioning
confidence: 99%
“…The role of attention in Transformers was studied by [WCM21, DGV + 18]. In terms of optimization, [ZKV + 20] examined the impact of adaptive approaches on attention models, while [SZKS21] analyzed the dynamics of single-head attention to approximate Seq2Seq architecture's learning process. For most LLMs, it generally suffices to conduct attention computations in an approximate manner during the inference process, provided that there are adequate assurances of accuracy.…”
Section: Algorithmic Regularizationmentioning
confidence: 99%
“…Transformers. There is a long line of work investigating the capabilities [Vaswani et al, 2017, Dehghani et al, 2018, Yun et al, 2019, Pérez et al, 2019, Yao et al, 2021, Bhattamishra et al, 2020b, Zhang et al, 2022, limitations [Hahn, 2020, Bhattamishra et al, 2020a, applications [Lu et al, 2021a, Dosovitskiy et al, 2020, Parmar et al, 2018, and internal workings [Elhage et al, 2021, Snell et al, 2021, Weiss et al, 2021, Edelman et al, 2022, Olsson et al, 2022 of Transformer models. Most similar to our work, Müller et al [2021] introduce a "Prior-data fitted transformer network" that is trained to approximate Bayesian inference and generate predictions for downstream learning problems.…”
Section: Related Workmentioning
confidence: 99%
“…Optimization and Convergence In the realm of optimization, [SZKS21] concentrated on investigating the behavior of a single-head attention mechanism to emulate the process of learning a Seq2Seq model, while adaptive methods have been emphasized for attention models by [ZKV + 20].…”
Section: Transformer Theorymentioning
confidence: 99%