Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.3
|View full text |Cite
|
Sign up to set email alerts
|

Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings

Abstract: Recent studies have determined that the learned token embeddings of large-scale neural language models are degenerated to be anisotropic with a narrow-cone shape. This phenomenon, called the representation degeneration problem, facilitates an increase in the overall similarity between token embeddings that negatively affect the performance of the models. Although the existing methods that address the degeneration problem based on observations of the phenomenon triggered by the problem improves the performance … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 21 publications
0
1
0
Order By: Relevance
“…Lastly, we were only able to evaluate a limited number of word embedding algorithms that account for token frequency issues. Potential alternatives include KAFE (Ashfaq et al, 2022), which relies on a knowledge graph to improve token representations, and AGG (Yu et al, 2022), for which the code was not available at the time of conducting the experiments. Similarly, we chose to fine-tune our BERT model for four epochs in all cases to obtain a comparable setting.…”
Section: Limitationsmentioning
confidence: 99%
“…Lastly, we were only able to evaluate a limited number of word embedding algorithms that account for token frequency issues. Potential alternatives include KAFE (Ashfaq et al, 2022), which relies on a knowledge graph to improve token representations, and AGG (Yu et al, 2022), for which the code was not available at the time of conducting the experiments. Similarly, we chose to fine-tune our BERT model for four epochs in all cases to obtain a comparable setting.…”
Section: Limitationsmentioning
confidence: 99%