2023
DOI: 10.48550/arxiv.2301.00355
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Abstract: We present SECOND THOUGHTS, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, SECOND THOUGHTS not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretabi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 24 publications
0
7
0
Order By: Relevance
“…The inclusion of colourblind or otherwise impaired road users in traffic indeed represents an important human value, and perhaps also one that a human researcher would have wanted to consider; however, strictly speaking, it was not something that was prominently featured in the 30 summaries and therefore did not emerge in the human content analysis. The challenge of alignment, that is, to what extent the output that large language models generate aligns with human values, is an ongoing research topic [ 80 , 81 ]. It can also be noted here that these values may be context-dependent, whereby in some cases, a researcher might want to receive a mechanistic output, such for a meta-analysis, and in other cases, they would prefer an output that takes into account more general human values such as safety and inclusiveness.…”
Section: Discussionmentioning
confidence: 99%
“…The inclusion of colourblind or otherwise impaired road users in traffic indeed represents an important human value, and perhaps also one that a human researcher would have wanted to consider; however, strictly speaking, it was not something that was prominently featured in the 30 summaries and therefore did not emerge in the human content analysis. The challenge of alignment, that is, to what extent the output that large language models generate aligns with human values, is an ongoing research topic [ 80 , 81 ]. It can also be noted here that these values may be context-dependent, whereby in some cases, a researcher might want to receive a mechanistic output, such for a meta-analysis, and in other cases, they would prefer an output that takes into account more general human values such as safety and inclusiveness.…”
Section: Discussionmentioning
confidence: 99%
“…Some researchers found that only a small proportion of the whole response (e.g., one or two words) needs to be fixed. Thus, An edition module takes effect after the generation to fix some problems in some works [75,78]. Similarly, text style transfer or rephrasing from toxicity to non-toxicity can also be plugged in this stage [26,66].…”
Section: Towards Pipeline-based Systemmentioning
confidence: 99%
“…The first group generally seeks value alignment, i.e., some notion of steering language models towards producing societally-desirable text Liu et al, 2021a). We note a variety of vague goals such as to reduce "non-normative" (Peng et al, 2020) or "immoral" text (Liu et al, 2023c); to generate more "pro-social" or "legitimate" text (Bakker et al, 2022); or to encourage that LLM technologies have a "positive impact on society" (Liu et al, 2023b). Specific motivations include minimising toxic or offensive language (Dinan et al, 2019;Xu et al, 2021a;Ju et al, 2022;Scheurer et al, 2022;Korbak et al, 2023); improving safety (Liu et al, 2021a;Thoppilan et al, 2022;Ganguli et al, 2022;Jin et al, 2022); adapting to ethical or moral scenarios (Forbes et al, 2020;Jin et al, 2022); or achieving political ideological balance (Liu et al, 2021b).…”
Section: Conceptual Classificationmentioning
confidence: 99%
“…Explicit comparisons collected on model outputs are used to reveal the preferences of human raters (Gao et al, 2018;Ziegler et al, 2019;Askell et al, 2021;Jaques et al, 2020;Stiennon et al, 2020;Ganguli et al, 2022;Glaese et al, 2022). 6 More finegrained feedback includes binary or Likert scale questions on text attributes (Nakano et al, 2021;Menick et al, 2022;Thoppilan et al, 2022); natural language comments (Ju et al, 2022;Scheurer et al, 2022); or edits (Hancock et al, 2019;Liu et al, 2023c). Ideal demonstrations are used to ground norm-dependent or ethical judgements (Forbes et al, 2020;Pyatkin et al, 2022;Jin et al, 2022), or in combination with ratings to prime model behaviour (Nakano et al, 2021;Wu et al, 2021;Ouyang et al, 2022;Bakker et al, 2022).…”
Section: Collecting Feedbackmentioning
confidence: 99%
See 1 more Smart Citation