Online Learning for Unknown Partially Observable MDPs

Jafarnia-Jahromi, Mehdi; Jain, Rahul; Nayyar, Ashutosh

doi:10.48550/arxiv.2102.12661

Cited by 1 publication

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reinforcement learning in POMDPs. Our work is related to the recent line of research on developing provably efficient online RL methods for POMDPs (Guo et al, 2016;Krishnamurthy et al, 2016;Jin et al, 2020;Xiong et al, 2021;Jafarnia-Jahromi et al, 2021;Efroni et al, 2022;Liu et al, 2022). In the online setting, the actions are specified by history-dependent policies and thus the latent state does not directly affect the actions.…”

Section: Related Workmentioning

confidence: 98%

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Liu¹,

Min²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of P3O is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a n −1/2 -suboptimality, where n is the number of trajectories in the dataset. To our best knowledge, P3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset. Contents

show abstract

Section: Related Workmentioning

confidence: 98%