Abstract:Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNORM, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply 2 normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead… Show more
“…To prevent the attention operation from overflowing, we adopted QKNorm (Henry et al, 2020), L2 normalization of queries and keys before the dot product. We split the subsequences into training, validation, and testing datasets in an 8:1:1 ratio.…”
“…To prevent the attention operation from overflowing, we adopted QKNorm (Henry et al, 2020), L2 normalization of queries and keys before the dot product. We split the subsequences into training, validation, and testing datasets in an 8:1:1 ratio.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.