We present SECOND THOUGHTS, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, SECOND THOUGHTS not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness. * Work done during the internship at Dartmouth College. 2 We take the log-probability predicted by the LM, log Pr(y|x), which is the conditional log-probability of generating option y given input context x. We then compute its exponential for better readability. Such a protocol is also adopted by BIG-Bench: https://github.com/google/BIG-bench.36th Conference on Neural Information Processing Systems (NeurIPS 2022).the source to produce the target (Figure 2 (b)). This way the model learns how to recover from a value-unaligned, poisoned context during the generation phase.
Augmented Edits ModelingDP-based Edits Inference. Given two text strings, source and target, one can find unlimited ways to edit source to produce target. Thus, we apply two constraints onto the editing: (1) the edits should be combinations of generic editing operations-inserting, deleting, and replacing a single token; (2) each edit operation has a cost and our goal is to infer the chain-of-edits that has minimum cost. Under these constraints, the edits inference problem can be converted to a token-level "edit distance problem" (Jurafsky, 2000), which can be solved by dynamic programming (DP). We modify the algorithm to be able to receive customized editing costs (e.g., insert-1, delete-1, replace-2), to try to model different preferences on editing. We use special tokens to mark the start/end of editing and the new content to be inserted/replaced, and develop a decipher module that can translate the edit operations produced by DP into natural language (see §A.1 for a visualization of the whole process, and §A.3 for more discussion on edit based models).Augmented Edits Modeling (AEM). To augment the edits, we run the DP algorithm on the same source and target pairs with a variety of editing costs 4 to create a collection of chain-of-edits for each source-target pair, which we call positive demonstrations (y + ). We then fine-tune an LM on these source-edits-target text inputs (recall that the edits are turned into natural language). We call this Augmented Edits Modeling (AEM). Different from common language modeling, AEM includes the labor-free decomposition (i.e., the editing steps) into the training object, whereas prior works either train on costly manually-created decomposition (Ouyang et al., 2022; or, rather than training, prompt with such decomposition Nye et al., 2021). We also construct negative demonstrations (y − ) by us...