“…Concurrently, encoder-based Transformers, such as BERT (Devlin et al, 2019) and RoBERTa , have been analysed with attention rollout (Abnar and Zuidema, 2020), which models the information flow in the model with a Directed Acyclic Graph, where nodes are token representations and edges, attention weights. Recently, Ferrando et al (2022) have presented ALTI (Aggregation of Layer-wise Tokens Attributions), which applies the attention rollout method by substituting attention weights with refined token-to-token interactions. In this work, we present the first application of a rollout-based method to the encoderdecoder Transformers.…”