Computation of weighted sums of rewards for concurrent MDPs

Buchholz, Peter; Scheftelowitsch, Dimitri

doi:10.1007/s00186-018-0653-1

Cited by 22 publications

(24 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, we modify the definition of the set of informative MECs C I (M I ) in ( 8) to be (10) In (10), we require the informative MECs in C I (M I ) to contain at least one informative state-action pair from each set ISA ij .…”

Section: B Base Case: No Identity-revealing Transitionsmentioning

confidence: 99%

“…Multi-model MDPs: In the literature, there are several names for the model considered in this paper: hidden model MDPs [6], multi-task reinforcement learning [7], multipleenvironment MDPs [8], contextual MDPs [9], multi-scenario MDPs and concurrent MDPs [10], latent MDPs [11], and multi-model MDPs [2]. The authors in [6] model the adaptive management problems in conservation biology and natural resources management using a hidden model MDP.…”

Section: Introductionmentioning

confidence: 99%

“…We note here that one key difference between multiple-environment MDPs and POMDPs is that the unobservable state in multiple-environment MDPs, which corresponds to the identity of the ground truth MDP model in the candidate set, does not change with time. Control of multi-model MDPs has been studied in [10] and [2], where the authors develop algorithms to construct a single policy that maximizes a weighted sum of discounted rewards for the candidate MDPs in the finite and infinite horizon, respectively. In the finite-horizon case [2], the authors study both historydependent and Markovian policies and show that deterministic policies are sufficient.…”

Section: Introductionmentioning

confidence: 99%

“…In the finite-horizon case [2], the authors study both historydependent and Markovian policies and show that deterministic policies are sufficient. In the infinite-horizon case [10], the authors focus on the stationary Markovian policies and show that randomization can be strictly more beneficial. Both problems are shown to be NP-hard and solved via mixed-integer programming.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the Detection of Markov Decision Processes

Duan¹,

Savas²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the detection problem for a finite set of Markov decision processes (MDPs) where the MDPs have the same state and action spaces but possibly different probabilistic transition functions. Any one of these MDPs could be the model for some underlying controlled stochastic process, but it is unknown a priori which MDP is the ground truth. We investigate whether it is possible to asymptotically detect the ground truth MDP model perfectly based on a single observed history (stateaction sequence). Since the generation of histories depends on the policy adopted to control the MDPs, we discuss the existence and synthesis of policies that allow for perfect detection. We start with the case of two MDPs and establish a necessary and sufficient condition for the existence of policies that lead to perfect detection. Based on this condition, we then develop an algorithm that efficiently (in time polynomial in the size of the MDPs) determines the existence of policies and synthesizes one when they exist. We further extend the results to the more general case where there are more than two MDPs in the candidate set, and we develop a policy synthesis algorithm based on the breadthfirst search and recursion. We demonstrate the effectiveness of our algorithms through numerical examples.

show abstract

Section: B Base Case: No Identity-revealing Transitionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Detection of Markov Decision Processes

Duan¹,

Savas²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Note that reinforcement learning for latent mixture environments here is different from Markov decision processes(MDPs) for non-stationary environments [24], [25], decentralized partially observable Markov decision process (Dec-POMDP) [26], and multi-model Markov decision pro-cesses [27], [28]. For non-stationary environments, both reward functions and state transition distributions are allowed to change with time for a trajectory.…”

Section: Introductionmentioning

confidence: 99%