Joar Skalse scite author profile

Joar Skalse

5Publications

38Citation Statements Received

62Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Oxford

Publications

Order By: Most citations

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan¹,

Merwijk²,

Mikulik³

et al. 2019

Preprint

View full text Add to dashboard Cite

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer-a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be-how will it differ from the loss function it was trained under-and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

show abstract

Neural networks are a priori biased towards Boolean functions with low entropy

Mingard¹,

Skalse²,

Valle-Pérez³

et al. 2019

Preprint

View full text Add to dashboard Cite

Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -a single-layer perceptron with n input neurons, one output neuron, and no threshold bias termwe prove that upon random initialisation of weights, the a priori probability P (t) that it represents a Boolean function that classifies t points in {0, 1} n as 1 has a remarkably simple form: P (t) = 2 −n for 0 ≤ t < 2 n . Since a perceptron can express far fewer Boolean functions with small or large values of t (low "entropy") than with intermediate values of t (high "entropy") there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect.

show abstract

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Skalse¹,

Farrugia-Roberts²,

Russell³

et al. 2022

Preprint

View full text Add to dashboard Cite

It's challenging to design reward functions for complex, real-world tasks. Reward learning lets one instead infer reward functions from data. However, multiple reward functions often fit the data equally well, even in the infinite-data limit. Prior work often considers reward functions to be uniquely recoverable, by imposing additional assumptions on data sources. By contrast, we formally characterise the partial identifiability of popular data sources, including demonstrations and trajectory preferences, under multiple common sets of assumptions. We analyse the impact of this partial identifiability on downstream tasks such as policy optimisation, including under changes in environment dynamics. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.

show abstract

Misspecification in Inverse Reinforcement Learning

Skalse¹,

Abate²

2022

Preprint

View full text Add to dashboard Cite

Misspecification in Inverse Reinforcement Learning

Skalse

Abate

2023

AAAI

View full text Add to dashboard Cite

The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from a policy pi. To do this, we need a model of how pi relates to R. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function R. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joar Skalse

Risks from Learned Optimization in Advanced Machine Learning Systems

Neural networks are a priori biased towards Boolean functions with low entropy

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Misspecification in Inverse Reinforcement Learning

Misspecification in Inverse Reinforcement Learning

Contact Info

Product

Resources

About