2022
DOI: 10.48550/arxiv.2210.01790
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Abstract: The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the lea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 27 publications
0
1
0
Order By: Relevance
“…Firstly, authors should make every effort to maximize interpretability of model outputs. Without any follow-up inquiries, it is unclear if a LLM misunderstood a prompt given by the authors, a phenomenon known as goal misgeneralization, 3 and can therefore yield moderate results, as explained in the studies. Alternatively, a LLM may experience hallucinations, providing plausible but factually incorrect information.…”
mentioning
confidence: 87%
“…Firstly, authors should make every effort to maximize interpretability of model outputs. Without any follow-up inquiries, it is unclear if a LLM misunderstood a prompt given by the authors, a phenomenon known as goal misgeneralization, 3 and can therefore yield moderate results, as explained in the studies. Alternatively, a LLM may experience hallucinations, providing plausible but factually incorrect information.…”
mentioning
confidence: 87%
“…While the debate on alignment is vast, we can define a system A to be aligned with H just in case A tries to do what H intends it to do. This definition can be further extended to value systems, rather than individual intentions.19 Goal misgeneralization does not arise because there is a mistake in the reward specification, but it is "a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations(Shah et al 2022). …”
mentioning
confidence: 99%
“…It seems likely that agents trained to maximize a particular metric would be evaluated as moral under such evaluations. That being said, a consequentialist approach to moral evaluation also suffers from a variety of pitfalls: It is difficult to specify any metric for agents to maximize without the possibility of the system acting in unanticipated ways (Bostrom, 2003;Krakovna et al, 2020;McLean et al, 2021;Shah et al, 2022). This solution also fails to recognize the importance of intentionality in human moral cognition.…”
mentioning
confidence: 99%