2022
DOI: 10.48550/arxiv.2205.01663
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adversarial Training for High-Stakes Reliability

Abstract: In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques-including a tool tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 34 publications
0
2
0
Order By: Relevance
“…Mishkin et al (2022) describes an operational process for doing red-teaming with external experts. Ziegler et al (2022) designed a tool to efficiently assist human adversaries to identify failures in a classifier. Models trained with red-teaming are found to be more robust to adversarial attack (Dinan et al 2019;Ziegler et al 2022) and humanin-the-loop dynamic data collection can efficiently improve model performance Kiela et al 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Mishkin et al (2022) describes an operational process for doing red-teaming with external experts. Ziegler et al (2022) designed a tool to efficiently assist human adversaries to identify failures in a classifier. Models trained with red-teaming are found to be more robust to adversarial attack (Dinan et al 2019;Ziegler et al 2022) and humanin-the-loop dynamic data collection can efficiently improve model performance Kiela et al 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Some of these efforts augment humans (through guidelines, templates, programmatic generation of attacks, and various combinations thereof) to devise test cases that cause systems to fail [45,46,29,21,30,55,6,23]. Others use humans in the loop to continuously and dynamically build, break, and fix [20] models in order to continuously make them more robust to failure modes [40,32,55,61]. Finally, a large body of work aims to learn adversarial examples that cause downstream models to produce spurious outputs [50], some of which are reviewed in [59].…”
Section: Related Workmentioning
confidence: 99%