Reinforcement Learning in Reward-Mixing MDPs

Kwon, Jeongyeol; Efroni, Yonathan; Caramanis, Constantine; Mannor, Shie

doi:10.48550/arxiv.2110.03743

Cited by 2 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, [KECM21a] provides a computationally efficient learning algorithm for a special case of POMDPs which they call reward-mixing MDPs. Specifically, the POMDP is a disjoint union of two MDPs which are identical except for the rewards.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, the convention in the recent theoretical RL literature on learning POMDPs [KAL16, GDB16, ALA16, JKKL20, XCGZ21, KECM21a, KECM21b] is to assume access to an oracle that solves the POMDP planning problem. There are some exceptions, but they require very strong assumptions on the model, which essentially trivialize the planning aspect -for instance, they assume that the state transitions are deterministic [KAL16,JKKL20], in which case (for known model) we always know the hidden state even without receiving any observations, or they assume that the "unobserved" portion of the state has constant size and never changes [KECM21a].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Planning in Observable POMDPs in Quasipolynomial Time

Golowich¹,

Moitra²,

Rohatgi³

2022

Preprint

View full text Add to dashboard Cite

Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: Are there natural and well-motivated assumptions that make planning easy?Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to wellseparated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Planning in Observable POMDPs in Quasipolynomial Time

Golowich¹,

Moitra²,

Rohatgi³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[JKKL20, KAL16] proved polynomialtime learning results assuming deterministic transitions of the POMDP. [KECM21a] obtains results for learning latent MDPs which are a mixture of 2 underlying models; since the uncertainty in the system can be modeled by a single-dimensional parameter, their results are computationally efficient.…”

Section: Additional Related Workmentioning

confidence: 99%

“…Planning algorithms in RL. The problem of planning in POMDPs (namely, finding the optimal policy when the model is known) has been extensively studied, with various proposed heuristics (e.g., [Mon82, CLZ97, HF00a, Hau00, RG02, TK03, PB04, SV05, PGT06, RPPCD08, SV10, SS12, SYHL13, GHL19, Han98, MKKC99, KMN99, LYX11, AYA18]), and a few provably efficient algorithms [BDRS96,KECM21a,GMR22]. Most closely related to our work is [GMR22], which shows a quasipolynomial-time planning algorithm for observable POMDPs.…”

Section: Additional Related Workmentioning

confidence: 99%

“…Nevertheless there is a sizeable literature devoted to overcoming the statistical intractability of the learning problem by restricting to natural subclasses of POMDPs [KAL16, GDB16, ALA16, JKKL20, XCGZ21, KECM21a, KECM21b, LCSJ22]. There are far fewer works attempting to overcome computational intractability, and all make severe restrictions on either the model dynamics [JKKL20,KAL16] or the structure of the uncertainty [BDRS96,KECM21a]. The standard practice is to simply sidestep computational issues by assuming access to strong oracles such as ones that solve Optimistic Planning (given a constrained, non-convex set of POMDPs, find the maximum value achievable by any policy on any POMDP in the set) [JKKL20] or Optimistic Maximum Likelihood Estimation (given a set of action/observation sample trajectories, find a POMDP which obtains maximum value conditioned on approximately maximizing the likelihood of seeing those trajectories) [LCSJ22].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning in Observable POMDPs, without Computationally Intractable Oracles

Golowich¹,

Moitra²,

Rohatgi³

2022

Preprint

View full text Add to dashboard Cite

Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.

show abstract

Reinforcement Learning in Reward-Mixing MDPs

Cited by 2 publications

References 22 publications

Planning in Observable POMDPs in Quasipolynomial Time

Planning in Observable POMDPs in Quasipolynomial Time

Learning in Observable POMDPs, without Computationally Intractable Oracles

Contact Info

Product

Resources

About