2020
DOI: 10.48550/arxiv.2009.01325
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to summarize from human feedback

Abstract: As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about-summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, hig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
100
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 46 publications
(103 citation statements)
references
References 48 publications
3
100
0
Order By: Relevance
“…We therefore collected examples of humans using the browser to answer questions, which we call demonstrations. However, training on demonstrations alone does not directly optimize answer quality, and is unlikely to lead far beyond human performance [Stiennon et al, 2020]. We therefore collected pairs of model-generated answers to the same question, and asked humans which one they preferred, which we call comparisons.…”
Section: Data Collectionmentioning
confidence: 99%
See 2 more Smart Citations
“…We therefore collected examples of humans using the browser to answer questions, which we call demonstrations. However, training on demonstrations alone does not directly optimize answer quality, and is unlikely to lead far beyond human performance [Stiennon et al, 2020]. We therefore collected pairs of model-generated answers to the same question, and asked humans which one they preferred, which we call comparisons.…”
Section: Data Collectionmentioning
confidence: 99%
“…Starting from the BC model with the final unembedding layer removed, we trained a model to take in a question and an answer with references, and output a scalar reward. Following Stiennon et al [2020], the reward represents an Elo score, scaled such that the difference between two scores represents the logit of the probability that one will be preferred to the other by the human labelers. The reward model is trained using a cross-entropy loss, with the comparisons as labels.…”
Section: Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…The generated texts are regarded as augmentation data to help improve the classification performance. Moreover, Stiennon et al [2020] use a RL-based approach on the task of English summarization, which fine-tunes the PLM by combining with human feedbacks.…”
Section: Fine-tuningmentioning
confidence: 99%
“…Human-in-the-loop can be applied to improve the performance of machine learning models by integrating human knowledge and experience for data analytics [24]. For example, human can significantly reduce algorithm bias in the training and inference in terms of human feedback for various tasks in the field of natural language processing (NLP) such as text classification [27], syntactic and semantic parsing [28], topic modeling [29], text summarization [30], and sentiment analysis [31]. The general framework is shown in Figure 3.…”
Section: B Human-in-the-loop (Hitl)mentioning
confidence: 99%