Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.19
|View full text |Cite
|
Sign up to set email alerts
|

ETC: Encoding Long and Structured Inputs in Transformers

Abstract: Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks.In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

3
159
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 191 publications
(162 citation statements)
references
References 43 publications
3
159
0
Order By: Relevance
“…, e i |s i | } where e i j = e(w i j ) + p token j . e(w i j ) and p token j are the token ... (Devlin et al, 2019), HiBERT and ETC (Ainslie et al, 2020). and positional embeddings of token w i j , respectively.…”
Section: Stepwise Hibertmentioning
confidence: 99%
See 4 more Smart Citations
“…, e i |s i | } where e i j = e(w i j ) + p token j . e(w i j ) and p token j are the token ... (Devlin et al, 2019), HiBERT and ETC (Ainslie et al, 2020). and positional embeddings of token w i j , respectively.…”
Section: Stepwise Hibertmentioning
confidence: 99%
“…However, the main disadvantage of this approach is that token-level attention across sentences is prohibited and long range attention only happens indirectly at the second-stage encoder (see the middle diagram in Figure 1). Recently, Extended Transformer Construction (ETC; Ainslie et al, 2020) provides an alternative. It alleviates the quadratic memory growth by introducing sparsity to the attention mechanism via its novel global-local attention mechanism (see the rightmost diagram in Figure 1).…”
Section: Stepwise Etcsummentioning
confidence: 99%
See 3 more Smart Citations