Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.379
|View full text |Cite
|
Sign up to set email alerts
|

Query-Key Normalization for Transformers

Abstract: Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNORM, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply 2 normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(18 citation statements)
references
References 25 publications
0
17
0
Order By: Relevance
“…In this RQ, we want to investigate the impact of four different normalization methods on automatic shellcode generation and summarization tasks. In particular, we consider the traditional normalization method in Transformer (i.e., PostNorm [70]), two state-of-the-art normalization methods (i.e., PreNorm [53] and QKNorm [24]), and our proposed normalization method Adjust QKNorm. The details of these normalization methods are illustrated as follows.…”
Section: Results Analysis For Rq4mentioning
confidence: 99%
See 3 more Smart Citations
“…In this RQ, we want to investigate the impact of four different normalization methods on automatic shellcode generation and summarization tasks. In particular, we consider the traditional normalization method in Transformer (i.e., PostNorm [70]), two state-of-the-art normalization methods (i.e., PreNorm [53] and QKNorm [24]), and our proposed normalization method Adjust QKNorm. The details of these normalization methods are illustrated as follows.…”
Section: Results Analysis For Rq4mentioning
confidence: 99%
“…With the advances of the Transformer, researchers conducted various architectural and functional modifications to improve its performance on low-resource tasks, including reducing the depth of the model, adding a regularization penalty or adjusting the order of the LayerNorm [50], [53]- [56]. The method QKNorm proposed by Henry et al [24] can achieve promising results on lowresource machine translation tasks. Specifically, the method QKNorm first performed L2 normalization of Q and K before performing the calculation, at which point the result obtained from the dot product is expressed as a cosine similarity calculation of Q and K. Thus the calculation results can be controlled in the interval [-1, 1], and does not need to be divided by √ d k .…”
Section: A Framework Of Dualscmentioning
confidence: 99%
See 2 more Smart Citations
“…However, they focused on parametrizing the temperature of timestep t using the activations from timestep t−1. Contemporary to this work, Henry et al (2020) proposed query-key normalization in Transformers. There is range of work trying to combine attention with convolution (Yin and Schütze, 2018;Yu et al, 2018).…”
Section: Datamentioning
confidence: 99%