2019
DOI: 10.48550/arxiv.1911.04070
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Zihao Ye,
Qipeng Guo,
Quan Gan
et al.

Abstract: The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of selfattention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields O(k • n log(n/k)) connections where k is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A serie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
34
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 24 publications
(34 citation statements)
references
References 16 publications
0
34
0
Order By: Relevance
“…This sparse graph structure alleviates the complexity from quadratic to linear. Ye et al (2019) adopted a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP) which allows fine-grained spans progressively to attend to coarse-grained spans, leading to a balance between computation complexity and model capacity. Li et al (2020b) proposed to learn word connections specific to the input via reinforcement learning.…”
Section: Related Workmentioning
confidence: 99%
“…This sparse graph structure alleviates the complexity from quadratic to linear. Ye et al (2019) adopted a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP) which allows fine-grained spans progressively to attend to coarse-grained spans, leading to a balance between computation complexity and model capacity. Li et al (2020b) proposed to learn word connections specific to the input via reinforcement learning.…”
Section: Related Workmentioning
confidence: 99%
“…However, these methods require significant engineering efforts and are hard to train [35]. Other works approach the problem via sparsification of the attention mechanism, such as random or local attention [13,18,31]. Sparsification methods have also been applied successfully for some computer vision tasks [17].…”
Section: Related Workmentioning
confidence: 99%
“…Several recent works have proposed strategies to increase the memory capacity of Transformers. BP-Transformer [Ye et al, 2019] is designed to incorporate the common-sense inductive bias of the hierarchical linguistic structure within the sentence, i.e., each query attends to context information from fine-grain to coarse-grain as the relative distance increase. [Rae et al, 2019] uses some pooling operator (e.g., max/mean pooling) to reduce the number of memories in the past, where all memories are equally compressed regardless of the content of the current query.…”
Section: Related Workmentioning
confidence: 99%
“…Although these approaches have achieved better speed-memory-accuracy trade-off, they still suffer from the aforementioned limitations of the self-attention mechanism. Another prominent line of work is to increase the memory capacity [Sukhbaatar et al, 2019, Ye et al, 2019, Rae et al, 2019. However, these works still process information at the same scale.…”
Section: Introductionmentioning
confidence: 99%