2021
DOI: 10.48550/arxiv.2110.02782
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How BPE Affects Memorization in Transformers

Abstract: Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size of the subword v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 33 publications
1
4
0
Order By: Relevance
“…Recent study shows that large language models can memorize its training data, and generate texts from training data given certain prompts Kharitonov et al, 2021;Thakkar et al, 2020;Carlini et al, 2019;Tirumala et al, 2022). Most related to our work, Carlini et al (2022) found that the memorization ability of LLMs significantly grows as the model capacity increases, the number of times an example has been duplicated, and the number of tokens of context used to prompt the model.…”
Section: Memorization In Large Language Modelssupporting
confidence: 55%
“…Recent study shows that large language models can memorize its training data, and generate texts from training data given certain prompts Kharitonov et al, 2021;Thakkar et al, 2020;Carlini et al, 2019;Tirumala et al, 2022). Most related to our work, Carlini et al (2022) found that the memorization ability of LLMs significantly grows as the model capacity increases, the number of times an example has been duplicated, and the number of tokens of context used to prompt the model.…”
Section: Memorization In Large Language Modelssupporting
confidence: 55%
“…Recent work has argued that memorization is not exclusively harmful, and can be crucial for certain types of generalization (e.g., on QA tasks) [19,20,21], while also allowing the models to encode significant amounts of world or factual knowledge [22,23,24]. There is also a growing body of work analyzing fundamental properties of memorization in language models [9,8,10]. Most related to our work [8] analyzes memorization of fully trained language models and observes a dependence on model scale, training data duplication, and prompting context length.…”
Section: Background and Related Workmentioning
confidence: 99%
“…However, perhaps surprisingly, relatively little work has been done in understanding the impact of scale on the dynamics of language model memorization over training. Existing work focuses on analyzing memorization post-training [8,9,10,11]. In this work, we study the memorization and forgetting dynamics in language models, with a focus on better measuring how they change as we scale up model size.…”
Section: Introductionmentioning
confidence: 99%
“…In comparisons with the baselines (Transformer and VHRED), generally (i) transformer-based LMs out-perform VHRED due to their attention mechanism that explicitly encodes sequential semantic information, and (ii) the MoE-LMs achieve way better diversity without sacrificing much on accuracy (i.e., the perplexity scores are still quite low). Qualitatively, the sample utterances generated the Transformer are closer to the targets than that by MoE-2 and MoE-4, likely because Transformer tends to memorize the corpus [Kharitonov et al, 2021]. Contrarily, MoE-LMs generate utterances that have similar contexts with targets but paraphrased or similar structures but different contexts, demonstrating their generalizability.…”
Section: Methodsmentioning
confidence: 85%