Learning in Observable POMDPs, without Computationally Intractable Oracles

Golowich, Noah; Moitra, Ankur; Rohatgi, Dhruv

doi:10.48550/arxiv.2206.03446

Cited by 1 publication

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a special case of POMDPs, we may consider applying the POMDP solutions for learning a near-optimal policy in RMMDPs. There is a growing body of work that focuses on the case when single or multiple-step observations from test action sequences are sufficient statistics of the environment (e.g., [4,28,2,19,14,34,46]). In such a scenario, latent model parameters can be learned up to some parameter transformations when the system is irreducible or optimistically explored.…”

Section: Solutions For General Pomdpsmentioning

confidence: 99%

See 1 more Smart Citation

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Kwon¹,

Efroni²,

Caramanis³

et al. 2022

Preprint

View full text Add to dashboard Cite

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a nearoptimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Previous work [29] established an upper bound for RMMDPs for M = 2. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary M ≥ 2, we provide a sample-efficient algorithm-EM 2 -that outputs an ǫ-optimal policy using Õ ǫ −2 • S d A d • poly(H, Z) d episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d = min(2M − 1, H). Our technique is a higher-order extension of the method-of-moments based approach proposed in [29], nevertheless, the design and analysis of the EM 2 algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of (SA) Ω( √ M) /ǫ 2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.

show abstract

Section: Solutions For General Pomdpsmentioning

confidence: 99%

“…proposed in [34,19]). The work in [30] does not give a satisfactory solution either since it requires a similar assumption of strong separability between latent contexts.…”

Section: Introductionmentioning

confidence: 99%