What Makes Good In-Context Examples for GPT-3?

Liu, Jiachang; Shen, Dinghan; Zhang, Yizhe; Dolan, Bill; Carin, Lawrence; Chen, Weizhu

doi:10.18653/v1/2022.deelio-1.10

Cited by 255 publications

(211 citation statements)

References 24 publications

Supporting

Mentioning

204

Contrasting

Order By: Relevance

“…More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al, 2019) or perform a variety of NLP tasks (Radford et al, 2019;Brown et al, 2020). There has also been work on eliciting prompting behavior in smaller models (Schick and Schütze, 2020; 11 https://huggingface.co/bigscience/ tr11-176B-ml-logs/tensorboard Gao et al, 2021b;Li and Liang, 2021;Lester et al, 2021;Scao and Rush, 2021), improving the flexibility of prompting (Shin et al, 2020), and understanding why and how prompting works (Liu et al, 2021;Min et al, 2022).…”

Section: Limitationsmentioning

confidence: 99%

OPT: Open Pre-trained Transformer Language Models

Zhang¹,

Roller²,

Goyal³

et al. 2022

Preprint

230

218

View full text Add to dashboard Cite

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero-and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, 1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models. * Equal contribution. † Work done while at Meta AI. 1 Following Brown et al. (2020), we use GPT-3 to refer to both the 175B model and the smaller scale models as well.2 Exceptions include work by EleutherAI, who released dense models up to 20B in size (Black et al., 2022), Salesforce (Nijkamp et al., 2022), and Meta AI, who released dense models up to 13B and sparse models up to 1. 1T (Artetxe et al., 2021). There is also ongoing work from the BigScience workshop (https://bigscience. huggingface.co/), which aims to open source very large multilingual language models and datasets.

show abstract

Section: Limitationsmentioning

confidence: 99%

OPT: Open Pre-trained Transformer Language Models

Zhang¹,

Roller²,

Goyal³

et al. 2022

Preprint

230

218

View full text Add to dashboard Cite

show abstract

“…To get the demonstrations, GPT-3 uses a random strategy to drawn from D train . Gao et al (2021) and Liu et al (2021a) use a BERT (Devlin et al, 2019)/RoBERTa (Liu et al, 2019) based retriever to sample examples with similarity. Their experiment 1 D train represents the training dataset for classification problems in their paper.…”

Section: Demonstration With Similaritymentioning

confidence: 99%

“…This method is first used by GPT series (Radford et al, 2019;Brown et al, 2020). While in GPT models the demonstrations are selected randomly, researchers found that the selection with similarity would significantly improve the final performance (Gao et al, 2021;Liu et al, 2021a). Liu et al (2021a) and Kumar and Talukdar (2021) also discovered that the order of prompts provided to the model has a great influence on the performance of the model.…”

Section: Generality Testmentioning

confidence: 99%

“…The second stage is to sample demonstrations. Gao et al (2021) and Liu et al (2021a) simply sample them by the static cosine similarity based on models trained on related datasets (this model is regarded as a retriever). This approach is suboptimal as: (1) it doesn't make use of the training data; (2) in cases of a low-resource language, we do not have sufficient labeled data to train such a retriever.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Few-Shot Natural Language Inference Generation with PDD: Prompt and Dynamic Demonstration

Li¹,

Gong²,

Zhu³

2022

Preprint

View full text Add to dashboard Cite

Natural Language Inference Generation task is to generate a text hypothesis given a text premise and a logical relation between the two. This task can be used in data augmentation and controllable text generation in practice. In this paper, we propose language models with prompt and dynamic demonstration (LM-PDD) to tackle this problem in few-shot settings. Our framework outperforms standard fine-tuned models with low resource, achieving an average 8% absolute improvement on SNLI and MNLI datasets, and the results on 13 natural language classification tasks also show that our dynamic demonstration method has good generalizability.

show abstract

“…Neural vs. Discrete Demonstration Compared with prior discrete demonstrations described in [12,33,47,26], retrieving weighted neural demonstrations from the knowledge-store to augment prompt learning has advantages in the following three major aspects: (1) neural demonstrations could be more tolerant of the model's maximum input length than discrete demonstrations, while the discrete demonstration is usually not suitable for multi-class classification tasks due to the limitation of input length, such as relation extraction, etc. (2) the model needs to deal with large retrieval tokens for discrete demonstration, making it time-consuming and computationally intensive to perform cross-attention operations due to the quadratic attention complexity.…”

Section: Retrieval Of Neural Demonstrationmentioning

confidence: 99%

Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning

Chen¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Prompt learning approaches have made waves in natural language processing by inducing better few-shot performance while they still follow a parametric-based learning paradigm; the oblivion and rote memorization problems in learning may encounter unstable generalization issues. Specifically, vanilla prompt learning may struggle to utilize atypical instances by rote during fully-supervised training or overfit shallow patterns with low-shot data. To alleviate such limitations, we develop RETROPROMPT with the motivation of decoupling knowledge from memorization to help the model strike a balance between generalization and memorization. In contrast with vanilla prompt learning, RETROPROMPT constructs an open-book knowledge-store from training instances and implements a retrieval mechanism during the process of input, training and inference, thus equipping the model with the ability to retrieve related contexts from the training corpus as cues for enhancement. Extensive experiments demonstrate that RETROPROMPT can obtain better performance in both few-shot and zero-shot settings. Besides, we further illustrate that our proposed RETROPROMPT can yield better generalization abilities with new datasets. Detailed analysis of memorization indeed reveals RETROPROMPT can reduce the reliance of language models on memorization; thus, improving generalization for downstream tasks.

show abstract

What Makes Good In-Context Examples for GPT-3?

Cited by 255 publications

References 24 publications

OPT: Open Pre-trained Transformer Language Models

OPT: Open Pre-trained Transformer Language Models

Few-Shot Natural Language Inference Generation with PDD: Prompt and Dynamic Demonstration

Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning

Contact Info

Product

Resources

About