2019
DOI: 10.48550/arxiv.1912.01683
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Optimal Policies Tend to Seek Power

Abstract: Some researchers have speculated that capable reinforcement learning (RL) agents pursuing misspecified objectives are often incentivized to seek resources and power in pursuit of those objectives. An agent seeking power is incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: humans seem idiosyncratic in their urges to power, which need not be present in the agents we design. We formalize a notion of power within the context of f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…Turner et al [26]'s theorem 33 shows the following result for deterministic dynamics and for single states s = s . We generalize to the stochastic case and to distributions over states.…”
Section: B1 Main Resultsmentioning
confidence: 97%
See 1 more Smart Citation
“…Turner et al [26]'s theorem 33 shows the following result for deterministic dynamics and for single states s = s . We generalize to the stochastic case and to distributions over states.…”
Section: B1 Main Resultsmentioning
confidence: 97%
“…For example, resources increase average optimal value, while immobility decreases it. Definition (Average optimal value [26]). V * avg (s) := E D V * R (s) .…”
Section: Theoretical Resultsmentioning
confidence: 99%
“…In human-AI collaboration, AI alignment is pivotal in ensuring AI systems pursue goals following human values or interests (Bostrom 2014;Russell 2019;Ngo, Chan, and Mindermann 2023). If left unchecked, unintended and undesirable goals, or emergent instrumental goals, such as self-preservation or power-seeking (Turner et al 2023), could have catastrophic consequences, including human extinction (Cotra 2022). Although various research directions and agendas have been proposed, including debate (Irving, Christiano, and Amodei 2018), scalable oversight (Bowman et al 2022), iterated distillation and amplification (Christiano, Shlegeris, and Amodei 2018), and reinforcement learning from human feedback (Christiano et al 2023), the field has not yet converged on an overarching paradigm.…”
Section: Introductionmentioning
confidence: 99%
“…Indeed, while capability robustness failures are concerning, objective robustness failures are potentially more dangerous: the only risks from an incapable agent are those of accidents from its incompetence, but the same is not true for an agent that capably pursues an incorrect objective, which can leverage its learned capabilities to visit potentially arbitrarily bad states (as assessed by the actual reward). For example, for most reward functions, the optimal policy will try to avoid being shut down [35]. Our main contributions are:…”
Section: Introductionmentioning
confidence: 99%