BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Ye, Zihao; Guo, Qipeng; Gan, Quan; Qiu, Xipeng; Zhang, Zheng

doi:10.48550/arxiv.1911.04070

Cited by 24 publications

(34 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This sparse graph structure alleviates the complexity from quadratic to linear. Ye et al (2019) adopted a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP) which allows fine-grained spans progressively to attend to coarse-grained spans, leading to a balance between computation complexity and model capacity. Li et al (2020b) proposed to learn word connections specific to the input via reinforcement learning.…”

Section: Related Workmentioning

confidence: 99%

GNN-LM: Language Modeling based on Global Contexts via GNN

Meng¹,

Zong²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Inspired by the notion that "to copy is easier than to memorize", in this work, we introduce GNN-LM, which extends vanilla neural language model (LM) by allowing to reference similar contexts in the entire training corpus. We build a directed heterogeneous graph between an input context and its semantically related neighbors selected from the training corpus, where nodes are tokens in the input context and retrieved neighbor contexts, and edges represent connections between nodes. Graph neural networks (GNNs) are constructed upon the graph to aggregate information from similar contexts to decode the token. This learning paradigm provides direct access to the reference contexts and helps improve a model's generalization ability. We conduct comprehensive experiments to validate the effectiveness of the GNN-LM: GNN-LM achieves a new state-of-the-art perplexity of 14.8 on WikiText-103 (a 4.5 point improvement over its counterpart of the vanilla LM model), and shows substantial improvement on One Billion Word and Enwiki8 datasets against strong baselines. In-depth ablation studies are performed to understand the mechanics of GNN-LM. 1

show abstract

Section: Related Workmentioning

confidence: 99%

GNN-LM: Language Modeling based on Global Contexts via GNN

Meng¹,

Zong²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, these methods require significant engineering efforts and are hard to train [35]. Other works approach the problem via sparsification of the attention mechanism, such as random or local attention [13,18,31]. Sparsification methods have also been applied successfully for some computer vision tasks [17].…”

Section: Related Workmentioning

confidence: 99%

Sparse Fusion for Multimodal Transformers

Ding¹,

Rich²,

Mason³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling. Evaluations are conducted on multiple multimodal benchmark datasets for a wide range of classification tasks. State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements. Extensive ablation studies showcase our benefits of combining sparsification and multimodal learning over naive approaches. This paves the way for enabling multimodal learning on low-resource devices.

show abstract

“…Several recent works have proposed strategies to increase the memory capacity of Transformers. BP-Transformer [Ye et al, 2019] is designed to incorporate the common-sense inductive bias of the hierarchical linguistic structure within the sentence, i.e., each query attends to context information from fine-grain to coarse-grain as the relative distance increase. [Rae et al, 2019] uses some pooling operator (e.g., max/mean pooling) to reduce the number of memories in the past, where all memories are equally compressed regardless of the content of the current query.…”

Section: Related Workmentioning

confidence: 99%

“…Although these approaches have achieved better speed-memory-accuracy trade-off, they still suffer from the aforementioned limitations of the self-attention mechanism. Another prominent line of work is to increase the memory capacity [Sukhbaatar et al, 2019, Ye et al, 2019, Rae et al, 2019. However, these works still process information at the same scale.…”

Section: Introductionmentioning

confidence: 99%

Adaptive Multi-Resolution Attention with Linear Complexity

Zhang¹,

Ma²,

Seidl³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have improved the state-of-the-art across numerous tasks in sequence modeling. Besides the quadratic computational and memory complexity w.r.t the sequence length, the self-attention mechanism only processes information at the same scale, i.e., all attention heads are in the same resolution, resulting in the limited power of the Transformer. To remedy this, we propose a novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space. Specifically, we leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Moreover, to capture the potential relations between query representation and clues of different attention granularities, we leave the decision of which resolution of attention to use to query, which further improves the model's capacity compared to vanilla Transformer. In an effort to reduce complexity, we adopt kernel attention without degrading the performance. Extensive experiments on several benchmarks demonstrate the effectiveness and efficiency of our model by achieving state-ofthe-art speed-memory-accuracy trade-off. To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.

show abstract

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Cited by 24 publications

References 16 publications

GNN-LM: Language Modeling based on Global Contexts via GNN

GNN-LM: Language Modeling based on Global Contexts via GNN

Sparse Fusion for Multimodal Transformers

Adaptive Multi-Resolution Attention with Linear Complexity

Contact Info

Product

Resources

About