Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Chang, Tyler A.; Xu, Yifan; Xu, Weiran; Tu, Zhuowen

doi:10.18653/v1/2021.acl-long.333

Cited by 9 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relative attention uses the relative position representations of Shaw et al (2018)'s approach but without the query-key dot product of multihead attention. Thus, relative attention differs from rPosNet in that content information is provided to the queries and is equal to dynamic convolutions with global context (Chang et al, 2021). In total, the x-axis of Figure 2 depicts position-position interactions for aPosNet and rPos-Net, token-position interactions for relative attention, token-token interactions for the Transformer, and token-token + token-position interactions for Shaw et al (2018).…”

Section: The Impact Of Gating and Query-key Informationmentioning

confidence: 99%

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Schmidt,

Di Gangi

2023

Proceedings of the Eighth Conference on Machine Translation

View full text Add to dashboard Cite

Position-based token-mixing approaches, such as FNet and MLPMixer, have shown to be exciting attention alternatives for computer vision and natural language understanding. The motivation is usually to remove redundant operations for higher efficiency on consumer GPUs while maintaining Transformer quality. On the hardware side, research on memristive crossbar arrays shows the possibility of efficiency gains up to two orders of magnitude by performing in-memory computation with weights stored on device. While it is impossible to store dynamic attention weights based on token-token interactions on device, position-based weights represent a concrete alternative if they only lead to minimal degradation. In this paper, we propose position-based attention as a variant of multihead attention where the attention weights are computed from position representations. A naive replacement of token vectors with position vectors in self-attention results in a significant loss in translation quality, which can be recovered by using relative position representations and a gating mechanism. We show analytically that this gating mechanism introduces some form of word dependency and validate its effectiveness experimentally under various conditions. The resulting network, rPosNet, outperforms previous position-based approaches and matches the quality of the Transformer with relative position embedding while requiring 20% less attention parameters after training. 1

show abstract

Section: The Impact Of Gating and Query-key Informationmentioning

confidence: 99%

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Schmidt,

Di Gangi

2023

Proceedings of the Eighth Conference on Machine Translation

View full text Add to dashboard Cite

show abstract

“…In the context of Transformers, countless works proposed ways to include some form of position-based attention bias (Shaw et al, 2018;Yang et al, 2018;Dai et al, 2019;Wang et al, 2020;Ke et al, 2021;Su et al, 2021;Luo et al, 2021;Qu et al, 2021;Chang et al, 2021;Wu et al, 2021;Wennberg & Henter, 2021;Likhomanenko et al, 2021;Dufter et al, 2022;Luo et al, 2022;Sun et al, 2022) (interalia). Dynamic convolution (Wu et al, 2019) and other similar models can also be Table 9.…”

Section: F Connections To Other Modelsmentioning

confidence: 99%

Keyphrase Extraction from Disaster-related Tweets

Ray

Caragea

2019

The World Wide Web Conference

View full text Add to dashboard Cite

While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.

show abstract

“…Recently, Pre-trained Language Models (PLMs) improve various downstream NLP tasks significantly (He et al 2020;Xu et al 2021;Chang et al 2021). In PLMs, the two-stage strategy (i.e., pre-training and fine-tuning) (Devlin et al 2019) inherits the knowledge learned during pre-training and applies it to downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

DKPLM: Decomposable Knowledge-Enhanced Pre-trained Language Model for Natural Language Understanding

Zhang

Wang

et al. 2022

AAAI

View full text Add to dashboard Cite

Knowledge-Enhanced Pre-trained Language Models (KEPLMs) are pre-trained models with relation triples injecting from knowledge graphs to improve language understanding abilities.Experiments show that our model outperforms other KEPLMs significantly over zero-shot knowledge probing tasks and multiple knowledge-aware language understanding tasks. To guarantee effective knowledge injection, previous studies integrate models with knowledge encoders for representing knowledge retrieved from knowledge graphs. The operations for knowledge retrieval and encoding bring significant computational burdens, restricting the usage of such models in real-world applications that require high inference speed. In this paper, we propose a novel KEPLM named DKPLM that decomposes knowledge injection process of the pre-trained language models in pre-training, fine-tuning and inference stages, which facilitates the applications of KEPLMs in real-world scenarios. Specifically, we first detect knowledge-aware long-tail entities as the target for knowledge injection, enhancing the KEPLMs' semantic understanding abilities and avoiding injecting redundant information. The embeddings of long-tail entities are replaced by ``pseudo token representations'' formed by relevant knowledge triples. We further design the relational knowledge decoding task for pre-training to force the models to truly understand the injected knowledge by relation triple reconstruction. Experiments show that our model outperforms other KEPLMs significantly over zero-shot knowledge probing tasks and multiple knowledge-aware language understanding tasks. We further show that DKPLM has a higher inference speed than other competing models due to the decomposing mechanism.

show abstract

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Cited by 9 publications

References 16 publications

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Keyphrase Extraction from Disaster-related Tweets

DKPLM: Decomposable Knowledge-Enhanced Pre-trained Language Model for Natural Language Understanding

Contact Info

Product

Resources

About