2022
DOI: 10.48550/arxiv.2201.03544
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Abstract: Reward hacking-where RL agents exploit gaps in misspecified reward functions-has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 22 publications
0
5
0
Order By: Relevance
“…While some problems, such as reachability and safety, naturally lend themselves to a reward-based formulation, such an interface is often cumbersome and arguably error-prone. This difficulty has been well documented, especially within the RL literature, under different terms including misaligned specification, specification gaming, and reward hacking (Pan, Bhatia, and Steinhardt 2022;Amodei et al 2016;Yuan et al 2019;Skalse et al 2022;Clark and Amodei 2016).…”
Section: Motivationmentioning
confidence: 99%
“…While some problems, such as reachability and safety, naturally lend themselves to a reward-based formulation, such an interface is often cumbersome and arguably error-prone. This difficulty has been well documented, especially within the RL literature, under different terms including misaligned specification, specification gaming, and reward hacking (Pan, Bhatia, and Steinhardt 2022;Amodei et al 2016;Yuan et al 2019;Skalse et al 2022;Clark and Amodei 2016).…”
Section: Motivationmentioning
confidence: 99%
“…As the agent and environment increase in complexity, it becomes harder to design reward functions that do not reward hack. Recent research provides sub-workflows to prevent this phenomenon, for example via detecting anomalous or aberrant policies (Pan et al, 2022), or via iteratively shaping reward functions (Gajcin et al, 2023) Visualizing the reward landscape, the reward surface over parameters of interest, is an effective tool to understand the complexity and convergence of reward function. The loss function can be complex due to the non-convex and/or NP-hardness (Li et al, 2018), while the reward function can be more complex, which is observed by visualizing the reward landscape (Sullivan et al, 2022).…”
Section: Designing the Reward Functionmentioning
confidence: 99%
“…This has led to a number of breakthroughs, via capabilities such as in-context learning (Radford et al, 2019;Brown et al, 2020) and chain-of-thought prompting (Wei et al, 2022b). However, it also poses risks: Pan et al (2022) show that scaling up the parameter count of models by as little as 30% can lead to emergent reward hacking.…”
Section: Introductionmentioning
confidence: 99%