2022
DOI: 10.48550/arxiv.2203.04212
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Measuring the Mixing of Contextual Information in the Transformer

Abstract: The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block -multi-head attention, residual connection, and layer normalization-and define a metric to measure token-to-token interactions within each layer, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 18 publications
0
6
0
Order By: Relevance
“…with the residual connection x i only considered in the transformed vector T i (x j=i ). Ferrando et al (2022) propose to use the Manhattan distance between the output vector and the transformed vector as a measure of the impact of x j on x i :…”
Section: Aggregation Of Layer-wise Token-to-token Interactions (Alti)mentioning
confidence: 99%
See 3 more Smart Citations
“…with the residual connection x i only considered in the transformed vector T i (x j=i ). Ferrando et al (2022) propose to use the Manhattan distance between the output vector and the transformed vector as a measure of the impact of x j on x i :…”
Section: Aggregation Of Layer-wise Token-to-token Interactions (Alti)mentioning
confidence: 99%
“…where each row in C x i ←x contains the contribution, or influence, of each x j in x i , i.e., the contribution of token representation j to token representation i 3 . ALTI method (Ferrando et al, 2022) follows the Transformer's modeling approach proposed by Abnar and Zuidema (2020), where the information flow in the model is simplified as a Directed Acyclic Graph, where nodes are token representations, and edges represent the influence of each input layer token x j in the output token x i . ALTI proposes using token contributions C instead of raw attention weights α.…”
Section: Aggregation Of Layer-wise Token-to-token Interactions (Alti)mentioning
confidence: 99%
See 2 more Smart Citations
“…In the computer vision literature, Chefer et al (2021b,a) combined this method with gradient information. Recently, Ferrando et al (2022) have presented ALTI (Aggregation of Layer-wise Tokens Attributions), which applies the attention rollout method by substituting attention weights with refined token-to-token interactions. In this work, we present the first application of a rollout-based method to sequence to sequence Transformers.…”
Section: Introductionmentioning
confidence: 99%