Adversarial Training for High-Stakes Reliability

Ziegler, Daniel M.; Nix, Seraphina; Chan, Lawrence; Bauman, Tim; Schmidt-Nielsen, Peter; Lin, Tong; Scherlis, Adam; Nabeshima, Noa; Weinstein-Raun, Ben; Haas, Daniel de; Buck, Shlegeris,; Thomas, Nate

doi:10.48550/arxiv.2205.01663

Cited by 2 publications

(2 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mishkin et al (2022) describes an operational process for doing red-teaming with external experts. Ziegler et al (2022) designed a tool to efficiently assist human adversaries to identify failures in a classifier. Models trained with red-teaming are found to be more robust to adversarial attack (Dinan et al 2019;Ziegler et al 2022) and humanin-the-loop dynamic data collection can efficiently improve model performance Kiela et al 2021).…”

Section: Related Workmentioning

confidence: 99%

A Holistic Approach to Undesired Content Detection in the Real World

Markov

Zhang

Agarwal

et al. 2023

AAAI

View full text Add to dashboard Cite

We present a holistic approach to building a robust and useful natural language classification system for real-world content moderation. The success of such a system relies on a chain of carefully designed and executed steps, including the design of content taxonomies and labeling instructions, data quality control, an active learning pipeline to capture rare events, and a variety of methods to make the model robust and to avoid overfitting. Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment. This approach generalizes to a wide range of different content taxonomies and can be used to create high-quality content classifiers that outperform off-the-shelf models.

show abstract

Section: Related Workmentioning

confidence: 99%

A Holistic Approach to Undesired Content Detection in the Real World

Markov

Zhang

Agarwal

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…Some of these efforts augment humans (through guidelines, templates, programmatic generation of attacks, and various combinations thereof) to devise test cases that cause systems to fail [45,46,29,21,30,55,6,23]. Others use humans in the loop to continuously and dynamically build, break, and fix [20] models in order to continuously make them more robust to failure modes [40,32,55,61]. Finally, a large body of work aims to learn adversarial examples that cause downstream models to produce spurious outputs [50], some of which are reviewed in [59].…”

Section: Related Workmentioning

confidence: 99%

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli¹,

Lovitt²,

Kernion³

et al. 2022

Preprint

View full text Add to dashboard Cite

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models. Warning: this paper contains examples that may be offensive or upsetting.

show abstract

Adversarial Training for High-Stakes Reliability

Cited by 2 publications

References 34 publications

A Holistic Approach to Undesired Content Detection in the Real World

A Holistic Approach to Undesired Content Detection in the Real World

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Contact Info

Product

Resources

About