How BPE Affects Memorization in Transformers

Kharitonov, Eugene; Baroni, Marco; Hupkes, Dieuwke

doi:10.48550/arxiv.2110.02782

Cited by 4 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent study shows that large language models can memorize its training data, and generate texts from training data given certain prompts Kharitonov et al, 2021;Thakkar et al, 2020;Carlini et al, 2019;Tirumala et al, 2022). Most related to our work, Carlini et al (2022) found that the memorization ability of LLMs significantly grows as the model capacity increases, the number of times an example has been duplicated, and the number of tokens of context used to prompt the model.…”

Section: Memorization In Large Language Modelssupporting

confidence: 55%

Recitation-Augmented Language Models

Sun¹,

Wang²,

Tay³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a new paradigm to help Large Language Models (LLMs) generate more accurate factual knowledge without retrieving from an external corpus, called RECITation-augmented gEneration (RECITE). Different from retrievalaugmented language models that retrieve relevant documents before generating the outputs, given an input, RECITE first recites one or several relevant passages from LLMs' own memory via sampling, and then produces the final answers. We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks. Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance in various closed-book question answering (CBQA) tasks. In experiments, we verify the effectiveness of RECITE on three pre-trained models (PaLM, UL2, and OPT) and three CBQA tasks (Natural Questions, TriviaQA, and HotpotQA). Question: who wrote the song i hate you i love you Answer: Gnash … Question: who wrote the school for good and evil Direct Generation (e.g., PaLM) Answer: Soman Chainani LM Input LM Output Question: who wrote the school for good and evil Retrieval-augmented Generation (e.g., Atlas) Answer: Soman Chainani LM Input LM Output The School for Good and Evil is a fantasy fairytale hexalogy of books by Soman Chainani… Retriever Results Question: who wrote the song i hate you i love you Recitation: "I Hate U, I Love U" (stylized in all lowercase) is a song by American singer and rapper Gnash featuring American singer Olivia O'Brien. Answer: Gnash … Question: who wrote the school for good and evil Recitation-augmented Generation (ours) Recitation: The School for Good and Evil was first published on May 14, 2013 by Soman Chainani… Answer: Soman Chainani LM Input LM Output * Work done during internship at Google.

show abstract

Section: Memorization In Large Language Modelssupporting

confidence: 55%

Recitation-Augmented Language Models

Sun¹,

Wang²,

Tay³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent work has argued that memorization is not exclusively harmful, and can be crucial for certain types of generalization (e.g., on QA tasks) [19,20,21], while also allowing the models to encode significant amounts of world or factual knowledge [22,23,24]. There is also a growing body of work analyzing fundamental properties of memorization in language models [9,8,10]. Most related to our work [8] analyzes memorization of fully trained language models and observes a dependence on model scale, training data duplication, and prompting context length.…”

Section: Background and Related Workmentioning

confidence: 99%

“…However, perhaps surprisingly, relatively little work has been done in understanding the impact of scale on the dynamics of language model memorization over training. Existing work focuses on analyzing memorization post-training [8,9,10,11]. In this work, we study the memorization and forgetting dynamics in language models, with a focus on better measuring how they change as we scale up model size.…”

Section: Introductionmentioning

confidence: 99%

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Tirumala¹,

Markosyan²,

Zettlemoyer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.

show abstract

“…In comparisons with the baselines (Transformer and VHRED), generally (i) transformer-based LMs out-perform VHRED due to their attention mechanism that explicitly encodes sequential semantic information, and (ii) the MoE-LMs achieve way better diversity without sacrificing much on accuracy (i.e., the perplexity scores are still quite low). Qualitatively, the sample utterances generated the Transformer are closer to the targets than that by MoE-2 and MoE-4, likely because Transformer tends to memorize the corpus [Kharitonov et al, 2021]. Contrarily, MoE-LMs generate utterances that have similar contexts with targets but paraphrased or similar structures but different contexts, demonstrating their generalizability.…”

Section: Methodsmentioning

confidence: 85%

A Mixture-of-Expert Approach to RL-based Dialogue Management

Chow¹,

Tulepbergenov²,

Nachum³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our MoE approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance.Preprint. Under review.

show abstract

How BPE Affects Memorization in Transformers

Cited by 4 publications

References 33 publications

Recitation-Augmented Language Models

Recitation-Augmented Language Models

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

A Mixture-of-Expert Approach to RL-based Dialogue Management

Contact Info

Product

Resources

About