2022
DOI: 10.48550/arxiv.2211.03540
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Measuring Progress on Scalable Oversight for Large Language Models

Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centere… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 13 publications
0
2
0
Order By: Relevance
“…21,22,[38][39][40][41] Human insight and oversight are critical components of the TRIPOD-LLM statement, reflecting an emphasis on components eventually critical for the responsible deployment of LLMs (though deployment reliability and observability are outside the scope of this paper). [42][43][44] The guidelines include requirements for increased reporting of the expected deployment context and specifying the levels of autonomy assigned to the LLM, if applicable. Furthermore, there is a focus on the quality control processes employed in dataset development and evaluation, such as qualifications of human assessors, requirement for dual annotation, and specific details on instructions provided to assessors to ensure that nuances of text evaluation are captured, thus facilitating reliable assessments of safety and performance.…”
Section: Discussionmentioning
confidence: 99%
“…21,22,[38][39][40][41] Human insight and oversight are critical components of the TRIPOD-LLM statement, reflecting an emphasis on components eventually critical for the responsible deployment of LLMs (though deployment reliability and observability are outside the scope of this paper). [42][43][44] The guidelines include requirements for increased reporting of the expected deployment context and specifying the levels of autonomy assigned to the LLM, if applicable. Furthermore, there is a focus on the quality control processes employed in dataset development and evaluation, such as qualifications of human assessors, requirement for dual annotation, and specific details on instructions provided to assessors to ensure that nuances of text evaluation are captured, thus facilitating reliable assessments of safety and performance.…”
Section: Discussionmentioning
confidence: 99%
“…If left unchecked, unintended and undesirable goals, or emergent instrumental goals, such as self-preservation or power-seeking (Turner et al 2023), could have catastrophic consequences, including human extinction (Cotra 2022). Although various research directions and agendas have been proposed, including debate (Irving, Christiano, and Amodei 2018), scalable oversight (Bowman et al 2022), iterated distillation and amplification (Christiano, Shlegeris, and Amodei 2018), and reinforcement learning from human feedback (Christiano et al 2023), the field has not yet converged on an overarching paradigm. Consequently, AI alignment remains an open problem (Amodei et al 2016;Hendrycks et al 2022;Ngo, Chan, and Mindermann 2023) that demands further investigation and exploration to foster safe and productive human-AI collaboration.…”
Section: Introductionmentioning
confidence: 99%