Query-Key Normalization for Transformers

Henry, Alex; Dachapally, Prudhvi Raj; Pawar, Shubham; Chen, Yuxuan

doi:10.18653/v1/2020.findings-emnlp.379

Cited by 13 publications

(18 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this RQ, we want to investigate the impact of four different normalization methods on automatic shellcode generation and summarization tasks. In particular, we consider the traditional normalization method in Transformer (i.e., PostNorm [70]), two state-of-the-art normalization methods (i.e., PreNorm [53] and QKNorm [24]), and our proposed normalization method Adjust QKNorm. The details of these normalization methods are illustrated as follows.…”

Section: Results Analysis For Rq4mentioning

confidence: 99%

“…With the advances of the Transformer, researchers conducted various architectural and functional modifications to improve its performance on low-resource tasks, including reducing the depth of the model, adding a regularization penalty or adjusting the order of the LayerNorm [50], [53]- [56]. The method QKNorm proposed by Henry et al [24] can achieve promising results on lowresource machine translation tasks. Specifically, the method QKNorm first performed L2 normalization of Q and K before performing the calculation, at which point the result obtained from the dot product is expressed as a cosine similarity calculation of Q and K. Thus the calculation results can be controlled in the interval [-1, 1], and does not need to be divided by √ d k .…”

Section: A Framework Of Dualscmentioning

confidence: 99%

“…We follow the idea of the QKNorm [24] and take into account that the insensitivity of the cosine similarity to values may lead to errors in the final results [57], we propose Adjust QKNorm. Notice that the matrices Q and K are threedimensional vectors, the first dimension is the size of the batch, the second dimension is the length of the input sequence, and the third dimension is the representation vector of the embedding corresponding to the word.…”

Section: A Framework Of Dualscmentioning

confidence: 99%

“…Moreover, it can also improve the generalization ability by learning the commonalities between these two tasks and transferring knowledge between these two tasks through shared representation. Finally, the method QKNorm [24] converts the self-attention computation from dot product to cosine similarity as a way to adapt Transformer to low resource tasks. However, cosine similarity mainly focuses on directionbased similarity (i.e., the closer the direction between two vectors, the higher the value of cosine similarity).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning

Yang¹,

Chen²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: Results Analysis For Rq4mentioning

confidence: 99%

Section: A Framework Of Dualscmentioning

confidence: 99%

Section: A Framework Of Dualscmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning

Yang¹,

Chen²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…However, they focused on parametrizing the temperature of timestep t using the activations from timestep t−1. Contemporary to this work, Henry et al (2020) proposed query-key normalization in Transformers. There is range of work trying to combine attention with convolution (Yin and Schütze, 2018;Yu et al, 2018).…”

Section: Datamentioning

confidence: 99%

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Dufter¹,

Schmitt²,

Schütze³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.

show abstract

Prediction of postoperative infection in elderly using deep learning-based analysis: an observational cohort study

Wang

et al. 2023

Aging Clin Exp Res

View full text Add to dashboard Cite

Elderly patients are susceptible to postoperative infections with increased mortality. Analyzing with a deep learning model, the perioperative factors that could predict and/or contribute to postoperative infections may improve the outcome in elderly. This was an observational cohort study with 2014 elderly patients who had elective surgery from 28 hospitals in China from April to June 2014. We aimed to develop and validate deep learning-based predictive models for postoperative infections in the elderly. 1510 patients were randomly assigned to be training dataset for establishing deep learning-based models, and 504 patients were used to validate the effectiveness of these models. The conventional model predicted postoperative infections was 0.728 (95% CI 0.688–0.768) with the sensitivity of 66.2% (95% CI 58.2–73.6) and specificity of 66.8% (95% CI 64.6–68.9). The deep learning model including risk factors relevant to baseline clinical characteristics predicted postoperative infections was 0.641 (95% CI 0.545–0.737), and sensitivity and specificity were 34.2% (95% CI 19.6–51.4) and 88.8% (95% CI 85.6–91.6), respectively. Including risk factors relevant to baseline variables and surgery, the deep learning model predicted postoperative infections was 0.763 (95% CI 0.681–0.844) with the sensitivity of 63.2% (95% CI 46–78.2) and specificity of 80.5% (95% CI 76.6–84). Our feasibility study indicated that a deep learning model including risk factors for the prediction of postoperative infections can be achieved in elderly. Further study is needed to assess whether this model can be used to guide clinical practice to improve surgical outcomes in elderly.

show abstract

Query-Key Normalization for Transformers

Cited by 13 publications

References 25 publications

DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning

DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Prediction of postoperative infection in elderly using deep learning-based analysis: an observational cohort study

Contact Info

Product

Resources

About