LiGCN: Label-interpretable Graph Convolutional Networks for Multi-label Text Classification

Li, Irene; Feng, Aosong; Wu, Hao; Li, Tianxiao; Suzumura, Toyotaro; Dong, Ruihai

doi:10.18653/v1/2022.dlg4nlp-1.7

Cited by 4 publications

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformers (Vaswani et al 2017) designed for sequential data have revolutionized the field of Natural Language Processing (NLP) (Liu et al 2019;Zhu et al 2020;Li et al 2022), and have recently made tremendous impact in graph learning (Yang et al 2021;Dwivedi and Bresson 2020) and computer vision (Dosovitskiy et al 2020;Huynh 2022). The self-attention used by regular Transformer models comes with a quadratic time and memory complexity O(n 2 ) for input sequence of length n, which prevents the application of Transformers to longer sequences in practical settings with limited computational resources.…”

Section: Introductionmentioning

confidence: 99%

Diffuser: Efficient Transformers with Multi-Hop Attention Diffusion for Long Sequences

Feng

Jiang

et al. 2023

AAAI

View full text Add to dashboard Cite

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose Diffuser, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67x memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.

show abstract

Section: Introductionmentioning

confidence: 99%