2021
DOI: 10.48550/arxiv.2110.03743
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reinforcement Learning in Reward-Mixing MDPs

Abstract: Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 22 publications
0
5
0
Order By: Relevance
“…Finally, [KECM21a] provides a computationally efficient learning algorithm for a special case of POMDPs which they call reward-mixing MDPs. Specifically, the POMDP is a disjoint union of two MDPs which are identical except for the rewards.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, [KECM21a] provides a computationally efficient learning algorithm for a special case of POMDPs which they call reward-mixing MDPs. Specifically, the POMDP is a disjoint union of two MDPs which are identical except for the rewards.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, the convention in the recent theoretical RL literature on learning POMDPs [KAL16, GDB16, ALA16, JKKL20, XCGZ21, KECM21a, KECM21b] is to assume access to an oracle that solves the POMDP planning problem. There are some exceptions, but they require very strong assumptions on the model, which essentially trivialize the planning aspect -for instance, they assume that the state transitions are deterministic [KAL16,JKKL20], in which case (for known model) we always know the hidden state even without receiving any observations, or they assume that the "unobserved" portion of the state has constant size and never changes [KECM21a].…”
Section: Introductionmentioning
confidence: 99%
“…[JKKL20, KAL16] proved polynomialtime learning results assuming deterministic transitions of the POMDP. [KECM21a] obtains results for learning latent MDPs which are a mixture of 2 underlying models; since the uncertainty in the system can be modeled by a single-dimensional parameter, their results are computationally efficient.…”
Section: Additional Related Workmentioning
confidence: 99%
“…Planning algorithms in RL. The problem of planning in POMDPs (namely, finding the optimal policy when the model is known) has been extensively studied, with various proposed heuristics (e.g., [Mon82, CLZ97, HF00a, Hau00, RG02, TK03, PB04, SV05, PGT06, RPPCD08, SV10, SS12, SYHL13, GHL19, Han98, MKKC99, KMN99, LYX11, AYA18]), and a few provably efficient algorithms [BDRS96,KECM21a,GMR22]. Most closely related to our work is [GMR22], which shows a quasipolynomial-time planning algorithm for observable POMDPs.…”
Section: Additional Related Workmentioning
confidence: 99%
See 1 more Smart Citation