Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; Askell, Amanda; Kernion, Jackson; Jones, Andy; Chen, Anna; Goldie, Anna; Mirhoseini, Azalia; Cameron, McKinnon,; Chen, Carol; Olsson, Catherine; Olah, Christopher; Danny, Hernandez,; Drain, Dawn; Ganguli, Deep; Li, Dustin; Tran-Johnson, Eli; Perez, Ethan; Kerr, Jamie; Mueller, Jared; Ladish, Jeffrey; Landau, Joshua D.; Kamal, Ndousse,; Lukosuite, Kamile; Liane, Lovitt,; Michael, Sellitto,; Elhage, Nelson; Schiefer, Nicholas; Noemi, Mercado,; DasSarma, Nova; Lasenby, Robert; Larson, Robin J.; Sam, Ringer,; Johnston, Scott G; Kravec, Shauna; Showk, Sheer El; Fort, Stanislav; Tamera, Lanham,; Timothy, Telleen-Lawton,; Conerly, Tom; Henighan, Tom; Hume, Tristan; Bowman, Samuel R.; Hatfield-Dodds, Zac; Mann, Ben; Amodei, Dario; Joseph, Nicholas; McCandlish, Sam; Brown, Tom; Kaplan, Jared

doi:10.48550/arxiv.2212.08073

Cited by 77 publications

(80 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…New Directions Concurrent and future work is beginning to explore two new directions: (a) expanding task diversity even more aggressively with synthetic data generation, particularly in creative, and open-ended dialogue (Wang et al, 2022b;Honovich et al, 2022;Ye et al, 2022;, and (b) offering human feedback signals on model responses (Ouyang et al, 2022;Glaese et al, 2022;Bai et al, 2022a;Bai et al, 2022b). We view most of these new directions as likely additive to a foundation of instruction tuning methods.…”

Section: Public Instruction Tuning Collectionsmentioning

confidence: 99%

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Longpre¹,

Hou²,

Vu³

et al. 2023

Preprint

View full text Add to dashboard Cite

We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 . Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks-motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available. 1

show abstract

Section: Public Instruction Tuning Collectionsmentioning

confidence: 99%

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Longpre¹,

Hou²,

Vu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, reinforcement learning from human feedback (RLHF) is then applied to better elicit LLM's internal knowledge and align with humans' values [96,137]. Based on RLHF, Bai et al [7] designed a RL from AI feedback diagram to get a more harmless (and still helpful) language model. Inference Phase.…”

Section: Towards E2e Conversational Modelmentioning

confidence: 99%

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Deng¹,

Sun²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

With the development of artificial intelligence, dialogue systems have been endowed with amazing chitchat capabilities, and there is widespread interest and discussion about whether the generated contents are socially beneficial. In this paper, we present a new perspective of research scope towards building a safe, responsible, and modal dialogue system, including 1) abusive and toxic contents, 2) unfairness and discrimination, 3) ethics and morality issues, and 4) risk of misleading and privacy information. Besides, we review the mainstream methods for evaluating the safety of large models from the perspectives of exposure and detection of safety issues. The recent advances in methodologies for the safety improvement of both end-to-end dialogue systems and pipeline-based models are further introduced. Finally, we discussed six existing challenges towards responsible AI: explainable safety monitoring, continuous learning of safety issues, robustness against malicious attacks, multimodal information processing, unified research framework, and multidisciplinary theory integration. We hope this survey will inspire further research toward safer dialogue systems. 1

show abstract

“…As large-scale pre-trained LMs become integrated in more systems, it is a matter of utmost societal importance to make sure that such models adhere to shared human values (Bai et al, 2022;Liu et al, 2021dLiu et al, , 2022. Here, we present a light-weight framework that can align the generation of LMs with such values, without requiring new data or extensive prompt-engineering.…”

Section: Ethics Broader Impact and Reproducibilitymentioning

confidence: 99%

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Liu¹,

Jia²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

We present SECOND THOUGHTS, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, SECOND THOUGHTS not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness. * Work done during the internship at Dartmouth College. 2 We take the log-probability predicted by the LM, log Pr(y|x), which is the conditional log-probability of generating option y given input context x. We then compute its exponential for better readability. Such a protocol is also adopted by BIG-Bench: https://github.com/google/BIG-bench.36th Conference on Neural Information Processing Systems (NeurIPS 2022).the source to produce the target (Figure 2 (b)). This way the model learns how to recover from a value-unaligned, poisoned context during the generation phase. Augmented Edits ModelingDP-based Edits Inference. Given two text strings, source and target, one can find unlimited ways to edit source to produce target. Thus, we apply two constraints onto the editing: (1) the edits should be combinations of generic editing operations-inserting, deleting, and replacing a single token; (2) each edit operation has a cost and our goal is to infer the chain-of-edits that has minimum cost. Under these constraints, the edits inference problem can be converted to a token-level "edit distance problem" (Jurafsky, 2000), which can be solved by dynamic programming (DP). We modify the algorithm to be able to receive customized editing costs (e.g., insert-1, delete-1, replace-2), to try to model different preferences on editing. We use special tokens to mark the start/end of editing and the new content to be inserted/replaced, and develop a decipher module that can translate the edit operations produced by DP into natural language (see §A.1 for a visualization of the whole process, and §A.3 for more discussion on edit based models).Augmented Edits Modeling (AEM). To augment the edits, we run the DP algorithm on the same source and target pairs with a variety of editing costs 4 to create a collection of chain-of-edits for each source-target pair, which we call positive demonstrations (y + ). We then fine-tune an LM on these source-edits-target text inputs (recall that the edits are turned into natural language). We call this Augmented Edits Modeling (AEM). Different from common language modeling, AEM includes the labor-free decomposition (i.e., the editing steps) into the training object, whereas prior works either train on costly manually-created decomposition (Ouyang et al., 2022; or, rather than training, prompt with such decomposition Nye et al., 2021). We also construct negative demonstrations (y − ) by us...

show abstract

Constitutional AI: Harmlessness from AI Feedback

Cited by 77 publications

References 0 publications

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Contact Info

Product

Resources

About