Self-Modification of Policy and Utility Function in Rational Agents

Everitt, Tom; Filan, Daniel; Daswani, Mayank; Hutter, Marcus

doi:10.48550/arxiv.1605.03142

Cited by 4 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(Much like how a human would probably "enjoy" taking addictive substances once they do, but not want to be an addict.) Similar ideas are explored in [50,71].…”

Section: Avoiding Reward Hackingmentioning

confidence: 90%

Concrete Problems in AI Safety

Amodei¹,

Olah²,

Steinhardt³

et al. 2016

Preprint

638

705

View full text Add to dashboard Cite

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

show abstract

“…(Much like how a human would probably "enjoy" taking addictive substances once they do, but not want to be an addict.) Similar ideas are explored in [50,71].…”

Section: Avoiding Reward Hackingmentioning

confidence: 90%

Concrete Problems in AI Safety

Amodei¹,

Olah²,

Steinhardt³

et al. 2016

Preprint

638

705

View full text Add to dashboard Cite

show abstract

“…Corrigibility and self-preservation. TI-unaware agents are weakly corrigible (Everitt, Filan, et al, 2016;Orseau and Armstrong, 2016), as they have no incentive to prevent the designer from updating the reward function. Unfortunately, TI-unaware agents may not be strongly corrigible.…”

Section: Solution 2: Ti-unaware Agentsmentioning

confidence: 99%

“…Previously called corruption aware(Everitt, 2018).12 And others formally verified(Everitt, Filan, et al, 2016;Hibbard, 2012;Orseau and Ring, 2011).13 Previously called corruption unaware(Everitt, 2018).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

Everitt¹,

Hutter²,

Kumar³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems. We also describe a number of modifications to the reinforcement learning objective that prevent incentives for reward tampering. We verify the solutions using recently developed graphical criteria for inferring agent incentives from causal influence diagrams. Along the way, we also compare corrigibility and self-preservation properties of the various solutions, and discuss how they can be combined into a single agent without reward tampering incentives.

show abstract

“…Bayesian, history-based agents have been used to formalize AGI in the so-called AIXIframework (Hutter, 2005; also discussed in Section 2.1). Extensions of this framework have been developed for studying goal alignment (Everitt and Hutter, 2018a), multi-agent interaction (Leike, Taylor, et al, 2016), space-time embeddedness (Orseau and Ring, 2012), self-modification (Everitt, Filan, et al, 2016;Orseau and Ring, 2011), observation modification (Ring and Orseau, 2011), self-duplication (Orseau, 2014a,b), knowledge seeking (Orseau, 2014c), decision theory (Everitt, Leike, et al, 2015), and others (Everitt and Hutter, 2018b). Some aspects of reasoning are swept under the rug by AIXI and Bayesian optimality.…”

Section: Formalizing Agimentioning

confidence: 99%

AGI Safety Literature Review

Everitt

Lea

Hutter

2018

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

The development of Artificial General Intelligence (AGI) promises to be a major event. Along with its many potential benefits, it also raises serious safety concerns (Bostrom, 2014). The intention of this paper is to provide an easily accessible and up-to-date collection of references for the emerging field of AGI safety. A significant number of safety problems for AGI have been identified. We list these, and survey recent research on solving them. We also cover works on how best to think of AGI from the limited knowledge we have today, predictions for when AGI will first be created, and what will happen after its creation. Finally, we review the current public policy on AGI.A major challenge for AGI safety research is to find the right conceptual models for plausible AGIs. This is especially challenging since we can only guess at the technology, algorithms, and structure that will be used. Indeed, even if we had the blueprint of an AGI system, understanding and predicting its behavior might still be hard: Both its design and its be-

show abstract

Self-Modification of Policy and Utility Function in Rational Agents

Cited by 4 publications

References 0 publications

Concrete Problems in AI Safety

Concrete Problems in AI Safety

Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

AGI Safety Literature Review

Contact Info

Product

Resources

About