2022
DOI: 10.48550/arxiv.2207.14255
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Training of Language Models to Fill in the Middle

Abstract: We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 19 publications
(20 citation statements)
references
References 24 publications
0
20
0
Order By: Relevance
“…There is a strong correlation between the model parameter count and accuracy, 25 so we focus only on the largest models with more than 1B parameters. The architectures of models are all decoder-only like with the ability to insert completions, 26 (except when noted). The rst model is a GPT-3 12B ne-tuned on code (Codex) abbreviated as "cushman".…”
Section: Methodsmentioning
confidence: 99%
“…There is a strong correlation between the model parameter count and accuracy, 25 so we focus only on the largest models with more than 1B parameters. The architectures of models are all decoder-only like with the ability to insert completions, 26 (except when noted). The rst model is a GPT-3 12B ne-tuned on code (Codex) abbreviated as "cushman".…”
Section: Methodsmentioning
confidence: 99%
“…There is a strong correlation between model parameter count and accuracy, 24 so we focus only on the largest models with more than 1B parameters. The architectures of models are all decoder-only like GPT-3 3 with the ability to insert completions, 25 (except when noted). The first model is a GPT-3 12B fine-tuned on code (Codex) abbreviated as "cushman."…”
Section: Methodsmentioning
confidence: 99%
“…Other approaches to using DNNs for interpolation involve using the DNN to learn a probabilistic model of the data [5], and generate the interpolated values using the learned data distribution (with applications to natural language processing and time series analysis). In NLP the possibility of language models to learn to infill (missing parts of ) text [2] can also be considered close to a extrapolation method.…”
Section: Interpolation By Deep Neural Networkmentioning
confidence: 99%