Parameterized MDPs and Reinforcement Learning Problems—A Maximum Entropy Principle-Based Framework

Srivastava, Amber; Salapaka, Srinivasa M.

doi:10.1109/tcyb.2021.3102510

Cited by 6 publications

(12 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Remark 1. To ensure that the cumulative cost J µ ζη (s) is finite for all s ∈ S and the system reaches the cost-free termination state δ in finite steps, we assume that there exists atleast one proper policy μ(a|s) ∈ {0, 1} ∀a ∈ A, s ∈ S, and for all parameter values in ζ and η, under which there is a non-zero probability to reach the cost-free termination state δ starting from any state s ∈ S (please see [12] for proof ).…”

Section: Problem Formulationmentioning

confidence: 99%

“…Further, due to the additional state and action parameters it is difficult to model para-SDMs directly by using the existing frameworks [2,3,4,5] that model SDMs. We have addressed the static para-SDMs in [12].…”

Section: Introductionmentioning

confidence: 99%

“…One of the main contributions of our earlier work [12] on para-SDM is to view them as combinatorial optimization problems. This is owing to the combinatorially large number of possible sequences of states and actions (also, referred to as paths) in the SDM.…”

Section: Introductionmentioning

confidence: 99%

“…This viewpoint enables the use of Maximum Entropy Principle (MEP) -from statistical physics literature [13] -in addressing para-SDM problems; where in prior literature MEP has proven itself successful in addressing a wide variety of combinatorial optimization problems such as facility location problem [14], protein structure alignment [15], and graph and markov chain aggregation [16]. In brief, the MEP-based framework proposed in [12] simultaneously determines a distribution over the paths, and the parameter values that maximize the associated Shannon entropy [13] of the distribution; while ensuring that the expected cumulative cost of the para-SDM attains a pre-specified value. The framework, then, employs an iterative scheme (an annealing scheme) and improves upon the distribution over the paths (i.e., the decision policy) as well as the parameter values that correspond to decreasing cumulative cost values of the SDM.…”

Section: Introductionmentioning

confidence: 99%

“…The optimal decision policy of the SDM, which is also time-varying owing to the evolving parameters Υ(t), is evaluated as the fixed point of the recursive Bellman equation satisfied by underlying cumulative cost function. We build upon the Maximum Entropy Principle (MEP) based framework proposed in [12] that addresses the static para-SDM problems. In particular, this MEPbased framework results into a smooth approximation (also referred to as free-energy) of the cumulative cost, which we exploit as a control-Lyapunov function describing the dynamic para-SDM problems.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Time-Varying Parameters in Sequential Decision Making Problems

Srivastava¹,

Salapaka²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we address the class of Sequential Decision Making (SDM) problems that are characterized by time-varying parameters. These parameter dynamics are either pre-specified or manipulable. At any given time instant the decision policy -that governs the sequential decisions -along with all the parameter values determines the cumulative cost incurred by the underlying SDM. Thus, the objective is to determine the manipulable parameter dynamics as well as the time-varying decision policy such that the associated cost gets minimized at each time instant. To this end we develop a control-theoretic framework to design the unknown parameter dynamics such that it locates and tracks the optimal values of the parameters, and simultaneously determines the time-varying optimal sequential decision policy. Our methodology builds upon a Maximum Entropy Principle (MEP) based framework that addresses the static parameterized SDMs. More precisely, we utilize the resulting smooth approximation (from the above framework) of the cumulative cost as a control Lyapunov function. We show that under the resulting control law the parameters asymptotically track the local optimal, the proposed control law is Lipschitz continuous and bounded, as well as ensure that the decision policy of the SDM is optimal for a given set of parameter values. The simulations demonstrate the efficacy of our proposed methodology.

show abstract

Section: Problem Formulationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Time-Varying Parameters in Sequential Decision Making Problems

Srivastava¹,

Salapaka²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Reinforcement learning method based on sample regularization and adaptive learning rate for AGV path planning

Nie,

Zhang,

et al. 2025

Neurocomputing

View full text Add to dashboard Cite

On the Importance of Exploration for Real Life Learned Algorithms

Gracla

Bockelmann

Dekorsy

2022

2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC)

View full text Add to dashboard Cite

The quality of data driven learning algorithms scales significantly with the quality of data available. One of the most straight-forward ways to generate good data is to sample or explore the data source intelligently. Smart sampling can reduce the cost of gaining samples, reduce computation cost in learning, and enable the learning algorithm to adapt to unforeseen events. In this paper, we teach three Deep Q-Networks (DQN) with different exploration strategies to solve a problem of puncturing ongoing transmissions for URLLC messages. We demonstrate the efficiency of two adaptive exploration candidates, variance-based and Maximum Entropy-based exploration, compared to the standard, simple ǫ-greedy exploration approach.

show abstract

Parameterized MDPs and Reinforcement Learning Problems—A Maximum Entropy Principle-Based Framework

Cited by 6 publications

References 34 publications

Time-Varying Parameters in Sequential Decision Making Problems

Time-Varying Parameters in Sequential Decision Making Problems

Reinforcement learning method based on sample regularization and adaptive learning rate for AGV path planning

On the Importance of Exploration for Real Life Learned Algorithms

Contact Info

Product

Resources

About